# go-rocm Development Guide

## Prerequisites

### Hardware

- AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)
- Linux, amd64. The package does not build or run on any other platform

### Operating System

- Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)
- Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0
- The amdgpu kernel driver must be loaded (`/dev/kfd` must be present)

### ROCm

Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:

```bash
sudo apt install rocm-dev rocm-libs
rocm-smi           # verify GPU is detected
rocminfo           # verify gfx architecture
```

Confirm `/dev/kfd` exists and is accessible to your user. Add yourself to the `render` and `video` groups if needed:

```bash
sudo usermod -aG render,video $USER
```

### llama-server

llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.

**Build steps** (from the homelab):

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release

cmake --build build --parallel $(nproc) -t llama-server
```

The production binary on the homelab was built from commit `11c325c` (cloned 19 Feb 2026). Install to PATH:

```bash
sudo cp build/bin/llama-server /usr/local/bin/llama-server
llama-server --version
```

Alternatively, set `ROCM_LLAMA_SERVER_PATH` to the full binary path.

**Architecture note**: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; `rocminfo` on the actual hardware confirms gfx1100. Use `-DAMDGPU_TARGETS=gfx1100`. No `HSA_OVERRIDE_GFX_VERSION` override is required.

### Go

Go 1.25.5 or later (as specified in `go.mod`). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 `iter.Seq`.

### go-inference

go-rocm depends on `forge.lthn.ai/core/go-inference`. The `go.mod` replaces it with a local path (`../go-inference`). The go-inference directory must be present as a sibling of go-rocm:

```
Code/
├── go-rocm/
└── go-inference/
```

If checking out go-rocm independently: `go work sync` or adjust the `replace` directive.

## Running Tests

### Unit Tests (no GPU required)

The standard test invocation runs unit tests that do not touch GPU hardware:

```bash
go test ./...
```

This covers:
- `server_test.go` — `findLlamaServer`, `freePort`, `serverEnv`, `server.alive()`, dead-server error handling, retry behaviour
- `vram_test.go` — sysfs parsing logic
- `discover_test.go` — model discovery
- `internal/llamacpp/health_test.go` and `client_test.go` — HTTP client and SSE parser
- `internal/gguf/gguf_test.go` — GGUF binary parser

Some unit tests in `server_test.go` have the `//go:build linux && amd64` constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.

### Integration Tests (GPU required)

Integration tests are gated behind the `rocm` build tag:

```bash
go test -tags rocm -v -run TestROCm ./...
```

These tests require:
- `/dev/kfd` present
- `llama-server` in PATH or `ROCM_LLAMA_SERVER_PATH` set
- The test model at `/data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf` (SMB mount from M3)

Each test calls `skipIfNoROCm(t)` and `skipIfNoModel(t)` so they skip cleanly when hardware or the model mount is unavailable.

**Available integration tests:**

| Test | What it verifies |
|------|-----------------|
| `TestROCm_LoadAndGenerate` | Full load + Generate, checks architecture from GGUF metadata |
| `TestROCm_Chat` | Multi-turn Chat with chat template applied by llama-server |
| `TestROCm_ContextCancellation` | Context cancel stops iteration mid-stream |
| `TestROCm_GracefulShutdown` | Server survives context cancel; second Generate succeeds |
| `TestROCm_ConcurrentRequests` | Three goroutines calling Generate simultaneously |
| `TestROCm_DiscoverModels` | DiscoverModels returns non-empty result for model directory |

### Benchmarks (GPU required)

```bash
go test -tags rocm -bench=. -benchtime=3x ./...
```

Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:

| Benchmark | Metric reported |
|-----------|----------------|
| `BenchmarkDecode` | tok/s for 128-token generation |
| `BenchmarkTTFT` | µs/first-tok (time to first token) |
| `BenchmarkConcurrent` | tok/s-aggregate with 4 goroutines and 4 parallel slots |

Model load time is excluded from benchmark timing via `b.StopTimer()` / `b.StartTimer()`. VRAM usage is logged after each load via `GetVRAMInfo()`.

**Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):**

Decode speed:

| Model | tok/s | VRAM Used |
|-------|-------|-----------|
| Gemma3-4B-Q4_K_M | 102.5 | 4724 MiB |
| Llama-3.1-8B-Q4_K_M | 77.1 | 6482 MiB |
| Qwen-2.5-7B-Q4_K_M | 84.4 | 6149 MiB |

Time to first token:

| Model | TTFT |
|-------|------|
| Gemma3-4B-Q4_K_M | 13.8 ms |
| Llama-3.1-8B-Q4_K_M | 17.1 ms |
| Qwen-2.5-7B-Q4_K_M | 16.8 ms |

Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):

| Model | Aggregate tok/s | vs single-slot |
|-------|----------------|---------------|
| Gemma3-4B-Q4_K_M | 238.9 | 2.3x |
| Llama-3.1-8B-Q4_K_M | 166.2 | 2.2x |
| Qwen-2.5-7B-Q4_K_M | 178.0 | 2.1x |

## Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `ROCM_LLAMA_SERVER_PATH` | PATH lookup | Explicit path to llama-server binary |
| `HIP_VISIBLE_DEVICES` | overridden to `0` | go-rocm always sets this to 0 when spawning llama-server |
| `HSA_OVERRIDE_GFX_VERSION` | unset | Not required; GPU is native gfx1100 |
| `ROCM_MODEL_DIR` | none | Conventional directory for model files (not read by go-rocm itself) |

`HIP_VISIBLE_DEVICES=0` is set unconditionally by `serverEnv()`, overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.

## VRAM Budget

With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:

| Model | Quant | VRAM (model) | Context 4K | Total | Fits? |
|-------|-------|-------------|-----------|-------|-------|
| Qwen3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
| Gemma3-4B | Q4_K_M | ~3 GB | ~0.3 GB | ~3.3 GB | Yes |
| Llama3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
| Qwen3-8B | Q8_0 | ~9 GB | ~0.5 GB | ~9.5 GB | Yes |
| Gemma3-12B | Q4_K_M | ~7.5 GB | ~0.8 GB | ~8.3 GB | Yes |
| Gemma3-27B | Q4_K_M | ~16 GB | ~1.5 GB | ~17.5 GB | Tight |
| Llama3-70B | Q4_K_M | ~40 GB | ~2 GB | ~42 GB | No (partial offload) |

The context cap (`min(model_context_length, 4096)` by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.

## Test Patterns

Tests use `github.com/stretchr/testify/assert` and `require`. The naming convention from the broader go ecosystem applies:

- `_Good` suffix — happy path
- `_Bad` suffix — expected error conditions
- `_Ugly` suffix — panic or edge cases

Integration tests use `skipIfNoROCm(t)` and `skipIfNoModel(t)` guards. Never use `t.Fatal` to skip; always use `t.Skip`.

When writing new unit tests that do not need GPU hardware, do not add the `rocm` build tag. The `linux && amd64` tag is sufficient for tests that test Linux-specific code paths.

## Coding Standards

- **Language**: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings
- **Strict types**: `declare(strict_types=1)` is a PHP convention, but the Go equivalent applies: use concrete types, avoid `any` except where the interface demands it
- **Error messages**: Lower case, no trailing punctuation. Prefixed with the package context: `"rocm: ..."`, `"llamacpp: ..."`, `"gguf: ..."`
- **Formatting**: `gofmt` / `goimports`. No exceptions
- **Licence**: EUPL-1.2. All new files must include the licence header if adding a file header comment

## Conventional Commits

Use the conventional commits format:

```
type(scope): description

feat(server): add GPU layer count override via environment variable
fix(gguf): handle uint64 context_length from v3 producers
test(integration): add DiscoverModels test for SMB mount
docs(architecture): update VRAM budget table
```

Types: `feat`, `fix`, `test`, `docs`, `refactor`, `perf`, `chore`

## Co-Authorship

All commits must include the co-author trailer:

```
Co-Authored-By: Virgil <virgil@lethean.io>
```

## Adding a New Backend Feature

The typical sequence for a new go-rocm feature:

1. If the feature requires a go-inference interface change (new `LoadOption`, `GenerateOption`, or `TextModel` method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side
2. Write unit tests first; most server and client behaviour is testable without GPU hardware
3. If integration testing on the homelab is needed, use the `//go:build rocm` tag
4. Update `docs/architecture.md` if the data flow or component structure changes
5. Record benchmark results in `docs/history.md` under the relevant phase if performance characteristics change materially