go-rocm/docs/development.md
Snider 7915f7ad3c docs: graduate TODO/FINDINGS into production documentation
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:03:17 +00:00

9.2 KiB

go-rocm Development Guide

Prerequisites

Hardware

  • AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)
  • Linux, amd64. The package does not build or run on any other platform

Operating System

  • Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)
  • Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0
  • The amdgpu kernel driver must be loaded (/dev/kfd must be present)

ROCm

Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:

sudo apt install rocm-dev rocm-libs
rocm-smi           # verify GPU is detected
rocminfo           # verify gfx architecture

Confirm /dev/kfd exists and is accessible to your user. Add yourself to the render and video groups if needed:

sudo usermod -aG render,video $USER

llama-server

llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.

Build steps (from the homelab):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release

cmake --build build --parallel $(nproc) -t llama-server

The production binary on the homelab was built from commit 11c325c (cloned 19 Feb 2026). Install to PATH:

sudo cp build/bin/llama-server /usr/local/bin/llama-server
llama-server --version

Alternatively, set ROCM_LLAMA_SERVER_PATH to the full binary path.

Architecture note: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; rocminfo on the actual hardware confirms gfx1100. Use -DAMDGPU_TARGETS=gfx1100. No HSA_OVERRIDE_GFX_VERSION override is required.

Go

Go 1.25.5 or later (as specified in go.mod). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 iter.Seq.

go-inference

go-rocm depends on forge.lthn.ai/core/go-inference. The go.mod replaces it with a local path (../go-inference). The go-inference directory must be present as a sibling of go-rocm:

Code/
├── go-rocm/
└── go-inference/

If checking out go-rocm independently: go work sync or adjust the replace directive.

Running Tests

Unit Tests (no GPU required)

The standard test invocation runs unit tests that do not touch GPU hardware:

go test ./...

This covers:

  • server_test.gofindLlamaServer, freePort, serverEnv, server.alive(), dead-server error handling, retry behaviour
  • vram_test.go — sysfs parsing logic
  • discover_test.go — model discovery
  • internal/llamacpp/health_test.go and client_test.go — HTTP client and SSE parser
  • internal/gguf/gguf_test.go — GGUF binary parser

Some unit tests in server_test.go have the //go:build linux && amd64 constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.

Integration Tests (GPU required)

Integration tests are gated behind the rocm build tag:

go test -tags rocm -v -run TestROCm ./...

These tests require:

  • /dev/kfd present
  • llama-server in PATH or ROCM_LLAMA_SERVER_PATH set
  • The test model at /data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf (SMB mount from M3)

Each test calls skipIfNoROCm(t) and skipIfNoModel(t) so they skip cleanly when hardware or the model mount is unavailable.

Available integration tests:

Test What it verifies
TestROCm_LoadAndGenerate Full load + Generate, checks architecture from GGUF metadata
TestROCm_Chat Multi-turn Chat with chat template applied by llama-server
TestROCm_ContextCancellation Context cancel stops iteration mid-stream
TestROCm_GracefulShutdown Server survives context cancel; second Generate succeeds
TestROCm_ConcurrentRequests Three goroutines calling Generate simultaneously
TestROCm_DiscoverModels DiscoverModels returns non-empty result for model directory

Benchmarks (GPU required)

go test -tags rocm -bench=. -benchtime=3x ./...

Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:

Benchmark Metric reported
BenchmarkDecode tok/s for 128-token generation
BenchmarkTTFT µs/first-tok (time to first token)
BenchmarkConcurrent tok/s-aggregate with 4 goroutines and 4 parallel slots

Model load time is excluded from benchmark timing via b.StopTimer() / b.StartTimer(). VRAM usage is logged after each load via GetVRAMInfo().

Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):

Decode speed:

Model tok/s VRAM Used
Gemma3-4B-Q4_K_M 102.5 4724 MiB
Llama-3.1-8B-Q4_K_M 77.1 6482 MiB
Qwen-2.5-7B-Q4_K_M 84.4 6149 MiB

Time to first token:

Model TTFT
Gemma3-4B-Q4_K_M 13.8 ms
Llama-3.1-8B-Q4_K_M 17.1 ms
Qwen-2.5-7B-Q4_K_M 16.8 ms

Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):

Model Aggregate tok/s vs single-slot
Gemma3-4B-Q4_K_M 238.9 2.3x
Llama-3.1-8B-Q4_K_M 166.2 2.2x
Qwen-2.5-7B-Q4_K_M 178.0 2.1x

Environment Variables

Variable Default Purpose
ROCM_LLAMA_SERVER_PATH PATH lookup Explicit path to llama-server binary
HIP_VISIBLE_DEVICES overridden to 0 go-rocm always sets this to 0 when spawning llama-server
HSA_OVERRIDE_GFX_VERSION unset Not required; GPU is native gfx1100
ROCM_MODEL_DIR none Conventional directory for model files (not read by go-rocm itself)

HIP_VISIBLE_DEVICES=0 is set unconditionally by serverEnv(), overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.

VRAM Budget

With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:

Model Quant VRAM (model) Context 4K Total Fits?
Qwen3-8B Q4_K_M ~5 GB ~0.5 GB ~5.5 GB Yes
Gemma3-4B Q4_K_M ~3 GB ~0.3 GB ~3.3 GB Yes
Llama3-8B Q4_K_M ~5 GB ~0.5 GB ~5.5 GB Yes
Qwen3-8B Q8_0 ~9 GB ~0.5 GB ~9.5 GB Yes
Gemma3-12B Q4_K_M ~7.5 GB ~0.8 GB ~8.3 GB Yes
Gemma3-27B Q4_K_M ~16 GB ~1.5 GB ~17.5 GB Tight
Llama3-70B Q4_K_M ~40 GB ~2 GB ~42 GB No (partial offload)

The context cap (min(model_context_length, 4096) by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.

Test Patterns

Tests use github.com/stretchr/testify/assert and require. The naming convention from the broader go ecosystem applies:

  • _Good suffix — happy path
  • _Bad suffix — expected error conditions
  • _Ugly suffix — panic or edge cases

Integration tests use skipIfNoROCm(t) and skipIfNoModel(t) guards. Never use t.Fatal to skip; always use t.Skip.

When writing new unit tests that do not need GPU hardware, do not add the rocm build tag. The linux && amd64 tag is sufficient for tests that test Linux-specific code paths.

Coding Standards

  • Language: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings
  • Strict types: declare(strict_types=1) is a PHP convention, but the Go equivalent applies: use concrete types, avoid any except where the interface demands it
  • Error messages: Lower case, no trailing punctuation. Prefixed with the package context: "rocm: ...", "llamacpp: ...", "gguf: ..."
  • Formatting: gofmt / goimports. No exceptions
  • Licence: EUPL-1.2. All new files must include the licence header if adding a file header comment

Conventional Commits

Use the conventional commits format:

type(scope): description

feat(server): add GPU layer count override via environment variable
fix(gguf): handle uint64 context_length from v3 producers
test(integration): add DiscoverModels test for SMB mount
docs(architecture): update VRAM budget table

Types: feat, fix, test, docs, refactor, perf, chore

Co-Authorship

All commits must include the co-author trailer:

Co-Authored-By: Virgil <virgil@lethean.io>

Adding a New Backend Feature

The typical sequence for a new go-rocm feature:

  1. If the feature requires a go-inference interface change (new LoadOption, GenerateOption, or TextModel method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side
  2. Write unit tests first; most server and client behaviour is testable without GPU hardware
  3. If integration testing on the homelab is needed, use the //go:build rocm tag
  4. Update docs/architecture.md if the data flow or component structure changes
  5. Record benchmark results in docs/history.md under the relevant phase if performance characteristics change materially