Snider 7915f7ad3c docs: graduate TODO/FINDINGS into production documentation

Co-Authored-By: Virgil <virgil@lethean.io>

2026-02-20 15:03:17 +00:00

9.2 KiB

Raw Blame History

go-rocm Development Guide

Prerequisites

Hardware

AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)
Linux, amd64. The package does not build or run on any other platform

Operating System

Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)
Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0
The amdgpu kernel driver must be loaded (/dev/kfd must be present)

ROCm

Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:

sudo apt install rocm-dev rocm-libs
rocm-smi           # verify GPU is detected
rocminfo           # verify gfx architecture

Confirm /dev/kfd exists and is accessible to your user. Add yourself to the render and video groups if needed:

sudo usermod -aG render,video $USER

llama-server

llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.

Build steps (from the homelab):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release

cmake --build build --parallel $(nproc) -t llama-server

The production binary on the homelab was built from commit 11c325c (cloned 19 Feb 2026). Install to PATH:

sudo cp build/bin/llama-server /usr/local/bin/llama-server
llama-server --version

Alternatively, set ROCM_LLAMA_SERVER_PATH to the full binary path.

Architecture note: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; rocminfo on the actual hardware confirms gfx1100. Use -DAMDGPU_TARGETS=gfx1100. No HSA_OVERRIDE_GFX_VERSION override is required.

Go

Go 1.25.5 or later (as specified in go.mod). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 iter.Seq.

go-inference

go-rocm depends on forge.lthn.ai/core/go-inference. The go.mod replaces it with a local path (../go-inference). The go-inference directory must be present as a sibling of go-rocm:

Code/
├── go-rocm/
└── go-inference/

If checking out go-rocm independently: go work sync or adjust the replace directive.

Running Tests

Unit Tests (no GPU required)

The standard test invocation runs unit tests that do not touch GPU hardware:

go test ./...

This covers:

server_test.go — findLlamaServer, freePort, serverEnv, server.alive(), dead-server error handling, retry behaviour
vram_test.go — sysfs parsing logic
discover_test.go — model discovery
internal/llamacpp/health_test.go and client_test.go — HTTP client and SSE parser
internal/gguf/gguf_test.go — GGUF binary parser

Some unit tests in server_test.go have the //go:build linux && amd64 constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.

Integration Tests (GPU required)

Integration tests are gated behind the rocm build tag:

go test -tags rocm -v -run TestROCm ./...

These tests require:

/dev/kfd present
llama-server in PATH or ROCM_LLAMA_SERVER_PATH set
The test model at /data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf (SMB mount from M3)

Each test calls skipIfNoROCm(t) and skipIfNoModel(t) so they skip cleanly when hardware or the model mount is unavailable.

Available integration tests:

Test	What it verifies
`TestROCm_LoadAndGenerate`	Full load + Generate, checks architecture from GGUF metadata
`TestROCm_Chat`	Multi-turn Chat with chat template applied by llama-server
`TestROCm_ContextCancellation`	Context cancel stops iteration mid-stream
`TestROCm_GracefulShutdown`	Server survives context cancel; second Generate succeeds
`TestROCm_ConcurrentRequests`	Three goroutines calling Generate simultaneously
`TestROCm_DiscoverModels`	DiscoverModels returns non-empty result for model directory

Benchmarks (GPU required)

go test -tags rocm -bench=. -benchtime=3x ./...

Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:

Benchmark	Metric reported
`BenchmarkDecode`	tok/s for 128-token generation
`BenchmarkTTFT`	µs/first-tok (time to first token)
`BenchmarkConcurrent`	tok/s-aggregate with 4 goroutines and 4 parallel slots

Model load time is excluded from benchmark timing via b.StopTimer() / b.StartTimer(). VRAM usage is logged after each load via GetVRAMInfo().

Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):

Decode speed:

Model	tok/s	VRAM Used
Gemma3-4B-Q4_K_M	102.5	4724 MiB
Llama-3.1-8B-Q4_K_M	77.1	6482 MiB
Qwen-2.5-7B-Q4_K_M	84.4	6149 MiB

Time to first token:

Model	TTFT
Gemma3-4B-Q4_K_M	13.8 ms
Llama-3.1-8B-Q4_K_M	17.1 ms
Qwen-2.5-7B-Q4_K_M	16.8 ms

Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):

Model	Aggregate tok/s	vs single-slot
Gemma3-4B-Q4_K_M	238.9	2.3x
Llama-3.1-8B-Q4_K_M	166.2	2.2x
Qwen-2.5-7B-Q4_K_M	178.0	2.1x

Environment Variables

Variable	Default	Purpose
`ROCM_LLAMA_SERVER_PATH`	PATH lookup	Explicit path to llama-server binary
`HIP_VISIBLE_DEVICES`	overridden to `0`	go-rocm always sets this to 0 when spawning llama-server
`HSA_OVERRIDE_GFX_VERSION`	unset	Not required; GPU is native gfx1100
`ROCM_MODEL_DIR`	none	Conventional directory for model files (not read by go-rocm itself)

HIP_VISIBLE_DEVICES=0 is set unconditionally by serverEnv(), overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.

VRAM Budget

With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:

Model	Quant	VRAM (model)	Context 4K	Total	Fits?
Qwen3-8B	Q4_K_M	~5 GB	~0.5 GB	~5.5 GB	Yes
Gemma3-4B	Q4_K_M	~3 GB	~0.3 GB	~3.3 GB	Yes
Llama3-8B	Q4_K_M	~5 GB	~0.5 GB	~5.5 GB	Yes
Qwen3-8B	Q8_0	~9 GB	~0.5 GB	~9.5 GB	Yes
Gemma3-12B	Q4_K_M	~7.5 GB	~0.8 GB	~8.3 GB	Yes
Gemma3-27B	Q4_K_M	~16 GB	~1.5 GB	~17.5 GB	Tight
Llama3-70B	Q4_K_M	~40 GB	~2 GB	~42 GB	No (partial offload)

The context cap (min(model_context_length, 4096) by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.

Test Patterns

Tests use github.com/stretchr/testify/assert and require. The naming convention from the broader go ecosystem applies:

_Good suffix — happy path
_Bad suffix — expected error conditions
_Ugly suffix — panic or edge cases

Integration tests use skipIfNoROCm(t) and skipIfNoModel(t) guards. Never use t.Fatal to skip; always use t.Skip.

When writing new unit tests that do not need GPU hardware, do not add the rocm build tag. The linux && amd64 tag is sufficient for tests that test Linux-specific code paths.

Coding Standards

Language: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings
Strict types: declare(strict_types=1) is a PHP convention, but the Go equivalent applies: use concrete types, avoid any except where the interface demands it
Error messages: Lower case, no trailing punctuation. Prefixed with the package context: "rocm: ...", "llamacpp: ...", "gguf: ..."
Formatting: gofmt / goimports. No exceptions
Licence: EUPL-1.2. All new files must include the licence header if adding a file header comment

Conventional Commits

Use the conventional commits format:

type(scope): description

feat(server): add GPU layer count override via environment variable
fix(gguf): handle uint64 context_length from v3 producers
test(integration): add DiscoverModels test for SMB mount
docs(architecture): update VRAM budget table

Types: feat, fix, test, docs, refactor, perf, chore

Co-Authorship

All commits must include the co-author trailer:

Co-Authored-By: Virgil <virgil@lethean.io>

Adding a New Backend Feature

The typical sequence for a new go-rocm feature:

If the feature requires a go-inference interface change (new LoadOption, GenerateOption, or TextModel method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side
Write unit tests first; most server and client behaviour is testable without GPU hardware
If integration testing on the homelab is needed, use the //go:build rocm tag
Update docs/architecture.md if the data flow or component structure changes
Record benchmark results in docs/history.md under the relevant phase if performance characteristics change materially

9.2 KiB Raw Blame History