# go-rocm Development Guide ## Prerequisites ### Hardware - AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM) - Linux, amd64. The package does not build or run on any other platform ### Operating System - Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5) - Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0 - The amdgpu kernel driver must be loaded (`/dev/kfd` must be present) ### ROCm Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab: ```bash sudo apt install rocm-dev rocm-libs rocm-smi # verify GPU is detected rocminfo # verify gfx architecture ``` Confirm `/dev/kfd` exists and is accessible to your user. Add yourself to the `render` and `video` groups if needed: ```bash sudo usermod -aG render,video $USER ``` ### llama-server llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary. **Build steps** (from the homelab): ```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS=gfx1100 \ -DGGML_HIP_ROCWMMA_FATTN=ON \ -DCMAKE_BUILD_TYPE=Release cmake --build build --parallel $(nproc) -t llama-server ``` The production binary on the homelab was built from commit `11c325c` (cloned 19 Feb 2026). Install to PATH: ```bash sudo cp build/bin/llama-server /usr/local/bin/llama-server llama-server --version ``` Alternatively, set `ROCM_LLAMA_SERVER_PATH` to the full binary path. **Architecture note**: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; `rocminfo` on the actual hardware confirms gfx1100. Use `-DAMDGPU_TARGETS=gfx1100`. No `HSA_OVERRIDE_GFX_VERSION` override is required. ### Go Go 1.25.5 or later (as specified in `go.mod`). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 `iter.Seq`. ### go-inference go-rocm depends on `forge.lthn.ai/core/go-inference`. The `go.mod` replaces it with a local path (`../go-inference`). The go-inference directory must be present as a sibling of go-rocm: ``` Code/ ├── go-rocm/ └── go-inference/ ``` If checking out go-rocm independently: `go work sync` or adjust the `replace` directive. ## Running Tests ### Unit Tests (no GPU required) The standard test invocation runs unit tests that do not touch GPU hardware: ```bash go test ./... ``` This covers: - `server_test.go` — `findLlamaServer`, `freePort`, `serverEnv`, `server.alive()`, dead-server error handling, retry behaviour - `vram_test.go` — sysfs parsing logic - `discover_test.go` — model discovery - `internal/llamacpp/health_test.go` and `client_test.go` — HTTP client and SSE parser - `internal/gguf/gguf_test.go` — GGUF binary parser Some unit tests in `server_test.go` have the `//go:build linux && amd64` constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH. ### Integration Tests (GPU required) Integration tests are gated behind the `rocm` build tag: ```bash go test -tags rocm -v -run TestROCm ./... ``` These tests require: - `/dev/kfd` present - `llama-server` in PATH or `ROCM_LLAMA_SERVER_PATH` set - The test model at `/data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf` (SMB mount from M3) Each test calls `skipIfNoROCm(t)` and `skipIfNoModel(t)` so they skip cleanly when hardware or the model mount is unavailable. **Available integration tests:** | Test | What it verifies | |------|-----------------| | `TestROCm_LoadAndGenerate` | Full load + Generate, checks architecture from GGUF metadata | | `TestROCm_Chat` | Multi-turn Chat with chat template applied by llama-server | | `TestROCm_ContextCancellation` | Context cancel stops iteration mid-stream | | `TestROCm_GracefulShutdown` | Server survives context cancel; second Generate succeeds | | `TestROCm_ConcurrentRequests` | Three goroutines calling Generate simultaneously | | `TestROCm_DiscoverModels` | DiscoverModels returns non-empty result for model directory | ### Benchmarks (GPU required) ```bash go test -tags rocm -bench=. -benchtime=3x ./... ``` Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent: | Benchmark | Metric reported | |-----------|----------------| | `BenchmarkDecode` | tok/s for 128-token generation | | `BenchmarkTTFT` | µs/first-tok (time to first token) | | `BenchmarkConcurrent` | tok/s-aggregate with 4 goroutines and 4 parallel slots | Model load time is excluded from benchmark timing via `b.StopTimer()` / `b.StartTimer()`. VRAM usage is logged after each load via `GetVRAMInfo()`. **Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):** Decode speed: | Model | tok/s | VRAM Used | |-------|-------|-----------| | Gemma3-4B-Q4_K_M | 102.5 | 4724 MiB | | Llama-3.1-8B-Q4_K_M | 77.1 | 6482 MiB | | Qwen-2.5-7B-Q4_K_M | 84.4 | 6149 MiB | Time to first token: | Model | TTFT | |-------|------| | Gemma3-4B-Q4_K_M | 13.8 ms | | Llama-3.1-8B-Q4_K_M | 17.1 ms | | Qwen-2.5-7B-Q4_K_M | 16.8 ms | Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each): | Model | Aggregate tok/s | vs single-slot | |-------|----------------|---------------| | Gemma3-4B-Q4_K_M | 238.9 | 2.3x | | Llama-3.1-8B-Q4_K_M | 166.2 | 2.2x | | Qwen-2.5-7B-Q4_K_M | 178.0 | 2.1x | ## Environment Variables | Variable | Default | Purpose | |----------|---------|---------| | `ROCM_LLAMA_SERVER_PATH` | PATH lookup | Explicit path to llama-server binary | | `HIP_VISIBLE_DEVICES` | overridden to `0` | go-rocm always sets this to 0 when spawning llama-server | | `HSA_OVERRIDE_GFX_VERSION` | unset | Not required; GPU is native gfx1100 | | `ROCM_MODEL_DIR` | none | Conventional directory for model files (not read by go-rocm itself) | `HIP_VISIBLE_DEVICES=0` is set unconditionally by `serverEnv()`, overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU. ## VRAM Budget With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably: | Model | Quant | VRAM (model) | Context 4K | Total | Fits? | |-------|-------|-------------|-----------|-------|-------| | Qwen3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes | | Gemma3-4B | Q4_K_M | ~3 GB | ~0.3 GB | ~3.3 GB | Yes | | Llama3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes | | Qwen3-8B | Q8_0 | ~9 GB | ~0.5 GB | ~9.5 GB | Yes | | Gemma3-12B | Q4_K_M | ~7.5 GB | ~0.8 GB | ~8.3 GB | Yes | | Gemma3-27B | Q4_K_M | ~16 GB | ~1.5 GB | ~17.5 GB | Tight | | Llama3-70B | Q4_K_M | ~40 GB | ~2 GB | ~42 GB | No (partial offload) | The context cap (`min(model_context_length, 4096)` by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM. ## Test Patterns Tests use `github.com/stretchr/testify/assert` and `require`. The naming convention from the broader go ecosystem applies: - `_Good` suffix — happy path - `_Bad` suffix — expected error conditions - `_Ugly` suffix — panic or edge cases Integration tests use `skipIfNoROCm(t)` and `skipIfNoModel(t)` guards. Never use `t.Fatal` to skip; always use `t.Skip`. When writing new unit tests that do not need GPU hardware, do not add the `rocm` build tag. The `linux && amd64` tag is sufficient for tests that test Linux-specific code paths. ## Coding Standards - **Language**: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings - **Strict types**: `declare(strict_types=1)` is a PHP convention, but the Go equivalent applies: use concrete types, avoid `any` except where the interface demands it - **Error messages**: Lower case, no trailing punctuation. Prefixed with the package context: `"rocm: ..."`, `"llamacpp: ..."`, `"gguf: ..."` - **Formatting**: `gofmt` / `goimports`. No exceptions - **Licence**: EUPL-1.2. All new files must include the licence header if adding a file header comment ## Conventional Commits Use the conventional commits format: ``` type(scope): description feat(server): add GPU layer count override via environment variable fix(gguf): handle uint64 context_length from v3 producers test(integration): add DiscoverModels test for SMB mount docs(architecture): update VRAM budget table ``` Types: `feat`, `fix`, `test`, `docs`, `refactor`, `perf`, `chore` ## Co-Authorship All commits must include the co-author trailer: ``` Co-Authored-By: Virgil ``` ## Adding a New Backend Feature The typical sequence for a new go-rocm feature: 1. If the feature requires a go-inference interface change (new `LoadOption`, `GenerateOption`, or `TextModel` method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side 2. Write unit tests first; most server and client behaviour is testable without GPU hardware 3. If integration testing on the homelab is needed, use the `//go:build rocm` tag 4. Update `docs/architecture.md` if the data flow or component structure changes 5. Record benchmark results in `docs/history.md` under the relevant phase if performance characteristics change materially