go-rocm/docs/development.md

# go-rocm Development Guide

## Prerequisites

### Hardware

- AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)
- Linux, amd64. The package does not build or run on any other platform

### Operating System

- Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)
- Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0
- The amdgpu kernel driver must be loaded (`/dev/kfd` must be present)

### ROCm

Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:

```bash
sudo apt install rocm-dev rocm-libs
rocm-smi           # verify GPU is detected
rocminfo           # verify gfx architecture
```

Confirm `/dev/kfd` exists and is accessible to your user. Add yourself to the `render` and `video` groups if needed:

```bash
sudo usermod -aG render,video $USER
```

### llama-server

llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.

**Build steps** (from the homelab):

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release

cmake --build build --parallel $(nproc) -t llama-server
```

The production binary on the homelab was built from commit `11c325c` (cloned 19 Feb 2026). Install to PATH:

```bash
sudo cp build/bin/llama-server /usr/local/bin/llama-server
llama-server --version
```

Alternatively, set `ROCM_LLAMA_SERVER_PATH` to the full binary path.

**Architecture note**: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; `rocminfo` on the actual hardware confirms gfx1100. Use `-DAMDGPU_TARGETS=gfx1100`. No `HSA_OVERRIDE_GFX_VERSION` override is required.

### Go

Go 1.25.5 or later (as specified in `go.mod`). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 `iter.Seq`.

### go-inference

go-rocm depends on `forge.lthn.ai/core/go-inference`. The `go.mod` replaces it with a local path (`../go-inference`). The go-inference directory must be present as a sibling of go-rocm:

```
Code/
├── go-rocm/
└── go-inference/
```

If checking out go-rocm independently: `go work sync` or adjust the `replace` directive.

## Running Tests

### Unit Tests (no GPU required)

The standard test invocation runs unit tests that do not touch GPU hardware:

```bash
go test ./...
```

This covers:
- `server_test.go` — `findLlamaServer`, `freePort`, `serverEnv`, `server.alive()`, dead-server error handling, retry behaviour
- `vram_test.go` — sysfs parsing logic
- `discover_test.go` — model discovery
- `internal/llamacpp/health_test.go` and `client_test.go` — HTTP client and SSE parser
- `internal/gguf/gguf_test.go` — GGUF binary parser

Some unit tests in `server_test.go` have the `//go:build linux && amd64` constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.

### Integration Tests (GPU required)

Integration tests are gated behind the `rocm` build tag:

```bash
go test -tags rocm -v -run TestROCm ./...
```

These tests require:
- `/dev/kfd` present
- `llama-server` in PATH or `ROCM_LLAMA_SERVER_PATH` set
- The test model at `/data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf` (SMB mount from M3)

Each test calls `skipIfNoROCm(t)` and `skipIfNoModel(t)` so they skip cleanly when hardware or the model mount is unavailable.

**Available integration tests:**

| Test | What it verifies |
|------|-----------------|
| `TestROCm_LoadAndGenerate` | Full load + Generate, checks architecture from GGUF metadata |
| `TestROCm_Chat` | Multi-turn Chat with chat template applied by llama-server |
| `TestROCm_ContextCancellation` | Context cancel stops iteration mid-stream |
| `TestROCm_GracefulShutdown` | Server survives context cancel; second Generate succeeds |
| `TestROCm_ConcurrentRequests` | Three goroutines calling Generate simultaneously |
| `TestROCm_DiscoverModels` | DiscoverModels returns non-empty result for model directory |

### Benchmarks (GPU required)

```bash
go test -tags rocm -bench=. -benchtime=3x ./...
```

Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:

| Benchmark | Metric reported |
|-----------|----------------|
| `BenchmarkDecode` | tok/s for 128-token generation |
| `BenchmarkTTFT` | µs/first-tok (time to first token) |
| `BenchmarkConcurrent` | tok/s-aggregate with 4 goroutines and 4 parallel slots |

Model load time is excluded from benchmark timing via `b.StopTimer()` / `b.StartTimer()`. VRAM usage is logged after each load via `GetVRAMInfo()`.

**Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):**

Decode speed:

| Model | tok/s | VRAM Used |
|-------|-------|-----------|
| Gemma3-4B-Q4_K_M | 102.5 | 4724 MiB |
| Llama-3.1-8B-Q4_K_M | 77.1 | 6482 MiB |
| Qwen-2.5-7B-Q4_K_M | 84.4 | 6149 MiB |

Time to first token:

| Model | TTFT |
|-------|------|
| Gemma3-4B-Q4_K_M | 13.8 ms |
| Llama-3.1-8B-Q4_K_M | 17.1 ms |
| Qwen-2.5-7B-Q4_K_M | 16.8 ms |

Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):

| Model | Aggregate tok/s | vs single-slot |
|-------|----------------|---------------|
| Gemma3-4B-Q4_K_M | 238.9 | 2.3x |
| Llama-3.1-8B-Q4_K_M | 166.2 | 2.2x |
| Qwen-2.5-7B-Q4_K_M | 178.0 | 2.1x |

## Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `ROCM_LLAMA_SERVER_PATH` | PATH lookup | Explicit path to llama-server binary |
| `HIP_VISIBLE_DEVICES` | overridden to `0` | go-rocm always sets this to 0 when spawning llama-server |
| `HSA_OVERRIDE_GFX_VERSION` | unset | Not required; GPU is native gfx1100 |
| `ROCM_MODEL_DIR` | none | Conventional directory for model files (not read by go-rocm itself) |

`HIP_VISIBLE_DEVICES=0` is set unconditionally by `serverEnv()`, overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.

## VRAM Budget

With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:

| Model | Quant | VRAM (model) | Context 4K | Total | Fits? |
|-------|-------|-------------|-----------|-------|-------|
| Qwen3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
| Gemma3-4B | Q4_K_M | ~3 GB | ~0.3 GB | ~3.3 GB | Yes |
| Llama3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
| Qwen3-8B | Q8_0 | ~9 GB | ~0.5 GB | ~9.5 GB | Yes |
| Gemma3-12B | Q4_K_M | ~7.5 GB | ~0.8 GB | ~8.3 GB | Yes |
| Gemma3-27B | Q4_K_M | ~16 GB | ~1.5 GB | ~17.5 GB | Tight |
| Llama3-70B | Q4_K_M | ~40 GB | ~2 GB | ~42 GB | No (partial offload) |

The context cap (`min(model_context_length, 4096)` by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.

## Test Patterns

Tests use `github.com/stretchr/testify/assert` and `require`. The naming convention from the broader go ecosystem applies:

- `_Good` suffix — happy path
- `_Bad` suffix — expected error conditions
- `_Ugly` suffix — panic or edge cases

Integration tests use `skipIfNoROCm(t)` and `skipIfNoModel(t)` guards. Never use `t.Fatal` to skip; always use `t.Skip`.

When writing new unit tests that do not need GPU hardware, do not add the `rocm` build tag. The `linux && amd64` tag is sufficient for tests that test Linux-specific code paths.

## Coding Standards

- **Language**: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings
- **Strict types**: `declare(strict_types=1)` is a PHP convention, but the Go equivalent applies: use concrete types, avoid `any` except where the interface demands it
- **Error messages**: Lower case, no trailing punctuation. Prefixed with the package context: `"rocm: ..."`, `"llamacpp: ..."`, `"gguf: ..."`
- **Formatting**: `gofmt` / `goimports`. No exceptions
- **Licence**: EUPL-1.2. All new files must include the licence header if adding a file header comment

## Conventional Commits

Use the conventional commits format:

```
type(scope): description

feat(server): add GPU layer count override via environment variable
fix(gguf): handle uint64 context_length from v3 producers
test(integration): add DiscoverModels test for SMB mount
docs(architecture): update VRAM budget table
```

Types: `feat`, `fix`, `test`, `docs`, `refactor`, `perf`, `chore`

## Co-Authorship

All commits must include the co-author trailer:

```
Co-Authored-By: Virgil <virgil@lethean.io>
```

## Adding a New Backend Feature

The typical sequence for a new go-rocm feature:

1. If the feature requires a go-inference interface change (new `LoadOption`, `GenerateOption`, or `TextModel` method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side
2. Write unit tests first; most server and client behaviour is testable without GPU hardware
3. If integration testing on the homelab is needed, use the `//go:build rocm` tag
4. Update `docs/architecture.md` if the data flow or component structure changes
5. Record benchmark results in `docs/history.md` under the relevant phase if performance characteristics change materially
docs: graduate TODO/FINDINGS into production documentation Co-Authored-By: Virgil <virgil@lethean.io> 2026-02-20 15:03:17 +00:00			`# go-rocm Development Guide`

			`## Prerequisites`

			`### Hardware`

			`- AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)`
			`- Linux, amd64. The package does not build or run on any other platform`

			`### Operating System`

			`- Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)`
			`- Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0`
			- The amdgpu kernel driver must be loaded (`/dev/kfd` must be present)

			`### ROCm`

			`Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:`

			```bash
			`sudo apt install rocm-dev rocm-libs`
			`rocm-smi # verify GPU is detected`
			`rocminfo # verify gfx architecture`
			```

			Confirm `/dev/kfd` exists and is accessible to your user. Add yourself to the `render` and `video` groups if needed:

			```bash
			`sudo usermod -aG render,video $USER`
			```

			`### llama-server`

			`llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.`

			`Build steps (from the homelab):`

			```bash
			`git clone https://github.com/ggml-org/llama.cpp`
			`cd llama.cpp`

			`cmake -B build \`
			`-DGGML_HIP=ON \`
			`-DAMDGPU_TARGETS=gfx1100 \`
			`-DGGML_HIP_ROCWMMA_FATTN=ON \`
			`-DCMAKE_BUILD_TYPE=Release`

			`cmake --build build --parallel $(nproc) -t llama-server`
			```

			The production binary on the homelab was built from commit `11c325c` (cloned 19 Feb 2026). Install to PATH:

			```bash
			`sudo cp build/bin/llama-server /usr/local/bin/llama-server`
			`llama-server --version`
			```

			Alternatively, set `ROCM_LLAMA_SERVER_PATH` to the full binary path.

			Architecture note: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; `rocminfo` on the actual hardware confirms gfx1100. Use `-DAMDGPU_TARGETS=gfx1100`. No `HSA_OVERRIDE_GFX_VERSION` override is required.

			`### Go`

			Go 1.25.5 or later (as specified in `go.mod`). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 `iter.Seq`.

			`### go-inference`

			go-rocm depends on `forge.lthn.ai/core/go-inference`. The `go.mod` replaces it with a local path (`../go-inference`). The go-inference directory must be present as a sibling of go-rocm:

			```
			`Code/`
			`├── go-rocm/`
			`└── go-inference/`
			```

			If checking out go-rocm independently: `go work sync` or adjust the `replace` directive.

			`## Running Tests`

			`### Unit Tests (no GPU required)`

			`The standard test invocation runs unit tests that do not touch GPU hardware:`

			```bash
			`go test ./...`
			```

			`This covers:`
			- `server_test.go` — `findLlamaServer`, `freePort`, `serverEnv`, `server.alive()`, dead-server error handling, retry behaviour
			- `vram_test.go` — sysfs parsing logic
			- `discover_test.go` — model discovery
			- `internal/llamacpp/health_test.go` and `client_test.go` — HTTP client and SSE parser
			- `internal/gguf/gguf_test.go` — GGUF binary parser

			Some unit tests in `server_test.go` have the `//go:build linux && amd64` constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.

			`### Integration Tests (GPU required)`

			Integration tests are gated behind the `rocm` build tag:

			```bash
			`go test -tags rocm -v -run TestROCm ./...`
			```

			`These tests require:`
			- `/dev/kfd` present
			- `llama-server` in PATH or `ROCM_LLAMA_SERVER_PATH` set
			- The test model at `/data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf` (SMB mount from M3)

			Each test calls `skipIfNoROCm(t)` and `skipIfNoModel(t)` so they skip cleanly when hardware or the model mount is unavailable.

			`Available integration tests:`

			`\| Test \| What it verifies \|`
			`\|------\|-----------------\|`
			\| `TestROCm_LoadAndGenerate` \| Full load + Generate, checks architecture from GGUF metadata \|
			\| `TestROCm_Chat` \| Multi-turn Chat with chat template applied by llama-server \|
			\| `TestROCm_ContextCancellation` \| Context cancel stops iteration mid-stream \|
			\| `TestROCm_GracefulShutdown` \| Server survives context cancel; second Generate succeeds \|
			\| `TestROCm_ConcurrentRequests` \| Three goroutines calling Generate simultaneously \|
			\| `TestROCm_DiscoverModels` \| DiscoverModels returns non-empty result for model directory \|

			`### Benchmarks (GPU required)`

			```bash
			`go test -tags rocm -bench=. -benchtime=3x ./...`
			```

			`Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:`

			`\| Benchmark \| Metric reported \|`
			`\|-----------\|----------------\|`
			\| `BenchmarkDecode` \| tok/s for 128-token generation \|
			\| `BenchmarkTTFT` \| µs/first-tok (time to first token) \|
			\| `BenchmarkConcurrent` \| tok/s-aggregate with 4 goroutines and 4 parallel slots \|

			Model load time is excluded from benchmark timing via `b.StopTimer()` / `b.StartTimer()`. VRAM usage is logged after each load via `GetVRAMInfo()`.

			`Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):`

			`Decode speed:`

			`\| Model \| tok/s \| VRAM Used \|`
			`\|-------\|-------\|-----------\|`
			`\| Gemma3-4B-Q4_K_M \| 102.5 \| 4724 MiB \|`
			`\| Llama-3.1-8B-Q4_K_M \| 77.1 \| 6482 MiB \|`
			`\| Qwen-2.5-7B-Q4_K_M \| 84.4 \| 6149 MiB \|`

			`Time to first token:`

			`\| Model \| TTFT \|`
			`\|-------\|------\|`
			`\| Gemma3-4B-Q4_K_M \| 13.8 ms \|`
			`\| Llama-3.1-8B-Q4_K_M \| 17.1 ms \|`
			`\| Qwen-2.5-7B-Q4_K_M \| 16.8 ms \|`

			`Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):`

			`\| Model \| Aggregate tok/s \| vs single-slot \|`
			`\|-------\|----------------\|---------------\|`
			`\| Gemma3-4B-Q4_K_M \| 238.9 \| 2.3x \|`
			`\| Llama-3.1-8B-Q4_K_M \| 166.2 \| 2.2x \|`
			`\| Qwen-2.5-7B-Q4_K_M \| 178.0 \| 2.1x \|`

			`## Environment Variables`

			`\| Variable \| Default \| Purpose \|`
			`\|----------\|---------\|---------\|`
			\| `ROCM_LLAMA_SERVER_PATH` \| PATH lookup \| Explicit path to llama-server binary \|
			\| `HIP_VISIBLE_DEVICES` \| overridden to `0` \| go-rocm always sets this to 0 when spawning llama-server \|
			\| `HSA_OVERRIDE_GFX_VERSION` \| unset \| Not required; GPU is native gfx1100 \|
			\| `ROCM_MODEL_DIR` \| none \| Conventional directory for model files (not read by go-rocm itself) \|

			`HIP_VISIBLE_DEVICES=0` is set unconditionally by `serverEnv()`, overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.

			`## VRAM Budget`

			`With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:`

			`\| Model \| Quant \| VRAM (model) \| Context 4K \| Total \| Fits? \|`
			`\|-------\|-------\|-------------\|-----------\|-------\|-------\|`
			`\| Qwen3-8B \| Q4_K_M \| ~5 GB \| ~0.5 GB \| ~5.5 GB \| Yes \|`
			`\| Gemma3-4B \| Q4_K_M \| ~3 GB \| ~0.3 GB \| ~3.3 GB \| Yes \|`
			`\| Llama3-8B \| Q4_K_M \| ~5 GB \| ~0.5 GB \| ~5.5 GB \| Yes \|`
			`\| Qwen3-8B \| Q8_0 \| ~9 GB \| ~0.5 GB \| ~9.5 GB \| Yes \|`
			`\| Gemma3-12B \| Q4_K_M \| ~7.5 GB \| ~0.8 GB \| ~8.3 GB \| Yes \|`
			`\| Gemma3-27B \| Q4_K_M \| ~16 GB \| ~1.5 GB \| ~17.5 GB \| Tight \|`
			`\| Llama3-70B \| Q4_K_M \| ~40 GB \| ~2 GB \| ~42 GB \| No (partial offload) \|`

			The context cap (`min(model_context_length, 4096)` by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.

			`## Test Patterns`

			Tests use `github.com/stretchr/testify/assert` and `require`. The naming convention from the broader go ecosystem applies:

			- `_Good` suffix — happy path
			- `_Bad` suffix — expected error conditions
			- `_Ugly` suffix — panic or edge cases

			Integration tests use `skipIfNoROCm(t)` and `skipIfNoModel(t)` guards. Never use `t.Fatal` to skip; always use `t.Skip`.

			When writing new unit tests that do not need GPU hardware, do not add the `rocm` build tag. The `linux && amd64` tag is sufficient for tests that test Linux-specific code paths.

			`## Coding Standards`

			`- Language: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings`
			- Strict types: `declare(strict_types=1)` is a PHP convention, but the Go equivalent applies: use concrete types, avoid `any` except where the interface demands it
			- Error messages: Lower case, no trailing punctuation. Prefixed with the package context: `"rocm: ..."`, `"llamacpp: ..."`, `"gguf: ..."`
			- Formatting: `gofmt` / `goimports`. No exceptions
			`- Licence: EUPL-1.2. All new files must include the licence header if adding a file header comment`

			`## Conventional Commits`

			`Use the conventional commits format:`

			```
			`type(scope): description`

			`feat(server): add GPU layer count override via environment variable`
			`fix(gguf): handle uint64 context_length from v3 producers`
			`test(integration): add DiscoverModels test for SMB mount`
			`docs(architecture): update VRAM budget table`
			```

			Types: `feat`, `fix`, `test`, `docs`, `refactor`, `perf`, `chore`

			`## Co-Authorship`

			`All commits must include the co-author trailer:`

			```
			`Co-Authored-By: Virgil <virgil@lethean.io>`
			```

			`## Adding a New Backend Feature`

			`The typical sequence for a new go-rocm feature:`

			1. If the feature requires a go-inference interface change (new `LoadOption`, `GenerateOption`, or `TextModel` method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side
			`2. Write unit tests first; most server and client behaviour is testable without GPU hardware`
			3. If integration testing on the homelab is needed, use the `//go:build rocm` tag
			4. Update `docs/architecture.md` if the data flow or component structure changes
			5. Record benchmark results in `docs/history.md` under the relevant phase if performance characteristics change materially