9.2 KiB
go-rocm Development Guide
Prerequisites
Hardware
- AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)
- Linux, amd64. The package does not build or run on any other platform
Operating System
- Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)
- Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0
- The amdgpu kernel driver must be loaded (
/dev/kfdmust be present)
ROCm
Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:
sudo apt install rocm-dev rocm-libs
rocm-smi # verify GPU is detected
rocminfo # verify gfx architecture
Confirm /dev/kfd exists and is accessible to your user. Add yourself to the render and video groups if needed:
sudo usermod -aG render,video $USER
llama-server
llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.
Build steps (from the homelab):
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1100 \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc) -t llama-server
The production binary on the homelab was built from commit 11c325c (cloned 19 Feb 2026). Install to PATH:
sudo cp build/bin/llama-server /usr/local/bin/llama-server
llama-server --version
Alternatively, set ROCM_LLAMA_SERVER_PATH to the full binary path.
Architecture note: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; rocminfo on the actual hardware confirms gfx1100. Use -DAMDGPU_TARGETS=gfx1100. No HSA_OVERRIDE_GFX_VERSION override is required.
Go
Go 1.25.5 or later (as specified in go.mod). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 iter.Seq.
go-inference
go-rocm depends on forge.lthn.ai/core/go-inference. The go.mod replaces it with a local path (../go-inference). The go-inference directory must be present as a sibling of go-rocm:
Code/
├── go-rocm/
└── go-inference/
If checking out go-rocm independently: go work sync or adjust the replace directive.
Running Tests
Unit Tests (no GPU required)
The standard test invocation runs unit tests that do not touch GPU hardware:
go test ./...
This covers:
server_test.go—findLlamaServer,freePort,serverEnv,server.alive(), dead-server error handling, retry behaviourvram_test.go— sysfs parsing logicdiscover_test.go— model discoveryinternal/llamacpp/health_test.goandclient_test.go— HTTP client and SSE parserinternal/gguf/gguf_test.go— GGUF binary parser
Some unit tests in server_test.go have the //go:build linux && amd64 constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.
Integration Tests (GPU required)
Integration tests are gated behind the rocm build tag:
go test -tags rocm -v -run TestROCm ./...
These tests require:
/dev/kfdpresentllama-serverin PATH orROCM_LLAMA_SERVER_PATHset- The test model at
/data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf(SMB mount from M3)
Each test calls skipIfNoROCm(t) and skipIfNoModel(t) so they skip cleanly when hardware or the model mount is unavailable.
Available integration tests:
| Test | What it verifies |
|---|---|
TestROCm_LoadAndGenerate |
Full load + Generate, checks architecture from GGUF metadata |
TestROCm_Chat |
Multi-turn Chat with chat template applied by llama-server |
TestROCm_ContextCancellation |
Context cancel stops iteration mid-stream |
TestROCm_GracefulShutdown |
Server survives context cancel; second Generate succeeds |
TestROCm_ConcurrentRequests |
Three goroutines calling Generate simultaneously |
TestROCm_DiscoverModels |
DiscoverModels returns non-empty result for model directory |
Benchmarks (GPU required)
go test -tags rocm -bench=. -benchtime=3x ./...
Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:
| Benchmark | Metric reported |
|---|---|
BenchmarkDecode |
tok/s for 128-token generation |
BenchmarkTTFT |
µs/first-tok (time to first token) |
BenchmarkConcurrent |
tok/s-aggregate with 4 goroutines and 4 parallel slots |
Model load time is excluded from benchmark timing via b.StopTimer() / b.StartTimer(). VRAM usage is logged after each load via GetVRAMInfo().
Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):
Decode speed:
| Model | tok/s | VRAM Used |
|---|---|---|
| Gemma3-4B-Q4_K_M | 102.5 | 4724 MiB |
| Llama-3.1-8B-Q4_K_M | 77.1 | 6482 MiB |
| Qwen-2.5-7B-Q4_K_M | 84.4 | 6149 MiB |
Time to first token:
| Model | TTFT |
|---|---|
| Gemma3-4B-Q4_K_M | 13.8 ms |
| Llama-3.1-8B-Q4_K_M | 17.1 ms |
| Qwen-2.5-7B-Q4_K_M | 16.8 ms |
Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):
| Model | Aggregate tok/s | vs single-slot |
|---|---|---|
| Gemma3-4B-Q4_K_M | 238.9 | 2.3x |
| Llama-3.1-8B-Q4_K_M | 166.2 | 2.2x |
| Qwen-2.5-7B-Q4_K_M | 178.0 | 2.1x |
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
ROCM_LLAMA_SERVER_PATH |
PATH lookup | Explicit path to llama-server binary |
HIP_VISIBLE_DEVICES |
overridden to 0 |
go-rocm always sets this to 0 when spawning llama-server |
HSA_OVERRIDE_GFX_VERSION |
unset | Not required; GPU is native gfx1100 |
ROCM_MODEL_DIR |
none | Conventional directory for model files (not read by go-rocm itself) |
HIP_VISIBLE_DEVICES=0 is set unconditionally by serverEnv(), overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.
VRAM Budget
With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:
| Model | Quant | VRAM (model) | Context 4K | Total | Fits? |
|---|---|---|---|---|---|
| Qwen3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
| Gemma3-4B | Q4_K_M | ~3 GB | ~0.3 GB | ~3.3 GB | Yes |
| Llama3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
| Qwen3-8B | Q8_0 | ~9 GB | ~0.5 GB | ~9.5 GB | Yes |
| Gemma3-12B | Q4_K_M | ~7.5 GB | ~0.8 GB | ~8.3 GB | Yes |
| Gemma3-27B | Q4_K_M | ~16 GB | ~1.5 GB | ~17.5 GB | Tight |
| Llama3-70B | Q4_K_M | ~40 GB | ~2 GB | ~42 GB | No (partial offload) |
The context cap (min(model_context_length, 4096) by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.
Test Patterns
Tests use github.com/stretchr/testify/assert and require. The naming convention from the broader go ecosystem applies:
_Goodsuffix — happy path_Badsuffix — expected error conditions_Uglysuffix — panic or edge cases
Integration tests use skipIfNoROCm(t) and skipIfNoModel(t) guards. Never use t.Fatal to skip; always use t.Skip.
When writing new unit tests that do not need GPU hardware, do not add the rocm build tag. The linux && amd64 tag is sufficient for tests that test Linux-specific code paths.
Coding Standards
- Language: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings
- Strict types:
declare(strict_types=1)is a PHP convention, but the Go equivalent applies: use concrete types, avoidanyexcept where the interface demands it - Error messages: Lower case, no trailing punctuation. Prefixed with the package context:
"rocm: ...","llamacpp: ...","gguf: ..." - Formatting:
gofmt/goimports. No exceptions - Licence: EUPL-1.2. All new files must include the licence header if adding a file header comment
Conventional Commits
Use the conventional commits format:
type(scope): description
feat(server): add GPU layer count override via environment variable
fix(gguf): handle uint64 context_length from v3 producers
test(integration): add DiscoverModels test for SMB mount
docs(architecture): update VRAM budget table
Types: feat, fix, test, docs, refactor, perf, chore
Co-Authorship
All commits must include the co-author trailer:
Co-Authored-By: Virgil <virgil@lethean.io>
Adding a New Backend Feature
The typical sequence for a new go-rocm feature:
- If the feature requires a go-inference interface change (new
LoadOption,GenerateOption, orTextModelmethod), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side - Write unit tests first; most server and client behaviour is testable without GPU hardware
- If integration testing on the homelab is needed, use the
//go:build rocmtag - Update
docs/architecture.mdif the data flow or component structure changes - Record benchmark results in
docs/history.mdunder the relevant phase if performance characteristics change materially