diff --git a/CLAUDE.md b/CLAUDE.md
index c4e1efa..41afd1c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -8,95 +8,80 @@ Implements `inference.Backend` and `inference.TextModel` (from `core/go-inferenc
 
 ## Target Hardware
 
-- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101
+- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM) — confirmed gfx1100, not gfx1101
 - **OS**: Ubuntu 24.04 LTS (linux/amd64)
-- **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
-- **Kernel**: 6.10+ recommended for RDNA 3 stability
+- **ROCm**: 7.2.0 installed
+- **Kernel**: 6.17.0
 
 ## Commands
 
 ```bash
-go test ./...                       # Run all tests (stubs on non-Linux)
-go test -tags rocm ./...            # Run with ROCm integration tests
-
-# On the Linux homelab:
-go test -v -run TestROCm ./...      # Full GPU tests
+go test ./...                       # Unit tests (no GPU required)
+go test -tags rocm ./...            # Integration tests + benchmarks (GPU required)
+go test -tags rocm -v -run TestROCm ./...   # Full GPU tests only
+go test -tags rocm -bench=. -benchtime=3x ./...  # Benchmarks
 ```
 
 ## Architecture
 
+See `docs/architecture.md` for full detail.
+
 ```
-go-rocm (this package)
-├── rocm.go              Package doc
-├── register_rocm.go     //go:build linux && amd64 — auto-registers via init()
-├── rocm_stub.go         //go:build !linux || !amd64 — ROCmAvailable() false
-├── backend.go           inference.Backend implementation
-├── model.go             inference.TextModel implementation (TODO)
-├── server.go            llama-server lifecycle management (TODO)
+go-rocm/
+├── backend.go           inference.Backend (linux && amd64)
+├── model.go             inference.TextModel (linux && amd64)
+├── server.go            llama-server subprocess lifecycle
+├── vram.go              VRAM monitoring via sysfs
+├── discover.go          GGUF model discovery
+├── register_rocm.go     auto-registers via init() (linux && amd64)
+├── rocm_stub.go         stubs for non-linux/non-amd64
 └── internal/
-    └── llamacpp/        llama-server HTTP client (TODO)
-        ├── client.go    OpenAI-compatible API client
-        └── health.go    Health check + readiness probe
+    ├── llamacpp/        llama-server HTTP client + health check
+    └── gguf/            GGUF v2/v3 binary metadata parser
 ```
 
-### How It Works
+## Critical: iGPU Crash
 
-1. `LoadModel()` spawns `llama-server` (llama.cpp) as a subprocess
-2. llama-server loads the GGUF model onto the AMD GPU via HIP/ROCm
-3. `Generate()` / `Chat()` make HTTP requests to llama-server's OpenAI-compatible API
-4. Token streaming via SSE (Server-Sent Events) from llama-server
-5. `Close()` sends SIGTERM to llama-server, waits for clean exit
-
-This is the subprocess approach (not CGO). It's simpler, more maintainable, and llama.cpp's server mode is battle-tested.
-
-### Dependencies
-
-- `forge.lthn.ai/core/go-inference` — shared TextModel/Backend interfaces
-- llama-server binary (external, not Go dependency) built with `-DGGML_HIP=ON`
+The Ryzen 9 9950X iGPU appears as ROCm Device 1. llama-server crashes trying to split tensors across it. `serverEnv()` always sets `HIP_VISIBLE_DEVICES=0`. Do not remove or weaken this.
 
 ## Building llama-server with ROCm
 
 ```bash
-# On the Linux homelab:
-sudo apt install rocm-dev rocm-libs  # ROCm 6.x
-
-git clone https://github.com/ggml-org/llama.cpp
-cd llama.cpp
 cmake -B build \
     -DGGML_HIP=ON \
     -DAMDGPU_TARGETS=gfx1100 \
     -DGGML_HIP_ROCWMMA_FATTN=ON \
     -DCMAKE_BUILD_TYPE=Release
-cmake --build build --parallel -t llama-server
-
-# Binary at build/bin/llama-server
-# Copy to /usr/local/bin/ or set ROCM_LLAMA_SERVER_PATH
+cmake --build build --parallel $(nproc) -t llama-server
+sudo cp build/bin/llama-server /usr/local/bin/llama-server
 ```
 
-### Performance Tip
+## Environment Variables
 
-The RX 7800 XT is gfx1101 but the ROCm compiler generates identical code for gfx1100. Setting:
-```bash
-export HSA_OVERRIDE_GFX_VERSION=11.0.0
-```
-...gives better performance on some ROCm versions. Benchmark both.
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `ROCM_LLAMA_SERVER_PATH` | PATH lookup | Path to llama-server binary |
+| `HIP_VISIBLE_DEVICES` | overridden to `0` | Always forced to 0 — do not rely on ambient value |
 
 ## Coding Standards
 
 - UK English
 - Tests: testify assert/require
+- Build tags: `linux && amd64` for GPU code, `rocm` for integration tests
 - Conventional commits
 - Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
 - Licence: EUPL-1.2
 
 ## Coordination
 
-- **Virgil** (core/go) is the orchestrator — writes tasks here
-- **go-mlx Claude** is the sibling — Metal backend on macOS, same interface contract
+- **Virgil** (core/go) is the orchestrator — writes tasks and reviews PRs
+- **go-mlx** is the sibling — Metal backend on macOS, same interface contract
 - **go-inference** defines the shared TextModel/Backend interfaces both backends implement
 - **go-ml** wraps both backends into the scoring engine
 
-## Task Queue
+## Documentation
 
-See `TODO.md` for prioritised work.
-See `FINDINGS.md` for research notes.
+- `docs/architecture.md` — component design, data flow, interface contracts
+- `docs/development.md` — prerequisites, test commands, benchmarks, coding standards
+- `docs/history.md` — completed phases, commit hashes, known limitations
+- `docs/plans/` — phase design documents (read-only reference)
diff --git a/FINDINGS.md b/FINDINGS.md
deleted file mode 100644
index 11b0845..0000000
--- a/FINDINGS.md
+++ /dev/null
@@ -1,338 +0,0 @@
-# FINDINGS.md — go-rocm Research & Discovery
-
----
-
-## 2026-02-19: Package Creation (Virgil)
-
-### Hardware
-
-- **GPU**: AMD Radeon RX 7800 XT
-- **Architecture**: RDNA 3, gfx1101
-- **VRAM**: 16GB GDDR6
-- **Compute Units**: 60
-- **OS**: Linux (Ubuntu, homelab machine)
-
-### ROCm Support Status
-
-- gfx1100/gfx1101 officially supported in ROCm 6.x+
-- Supported on Ubuntu 24.04.3 and 22.04.5
-- Kernel 6.10+ recommended for RDNA 3 stability
-- `/dev/kfd` device node required (amdgpu kernel driver)
-
-Sources:
-- [ROCm system requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html)
-- [ROCm compatibility matrix](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html)
-
-### llama.cpp + ROCm
-
-llama.cpp has mature ROCm/HIP support. Build flags:
-
-```bash
-cmake -B build \
-    -DGGML_HIP=ON \
-    -DAMDGPU_TARGETS=gfx1100 \
-    -DGGML_HIP_ROCWMMA_FATTN=ON \
-    -DCMAKE_BUILD_TYPE=Release
-```
-
-Key findings:
-- RX 7800 XT is gfx1101, but ROCm compiler generates identical code for gfx1100
-- `HSA_OVERRIDE_GFX_VERSION=11.0.0` may give better performance (benchmark needed)
-- rocWMMA flash attention (`-DGGML_HIP_ROCWMMA_FATTN=ON`) available for RDNA 3+
-- Docker images may not support hipBLASLt for gfx1100, falling back to hipBLAS
-- llama-server provides OpenAI-compatible API with SSE streaming
-
-Sources:
-- [llama.cpp ROCm build docs](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
-- [llama.cpp ROCm compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
-- [llama.cpp ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
-- [RX 7800 XT build discussion](https://github.com/ggml-org/llama.cpp/discussions/11572)
-
-### Design Decision: Subprocess vs CGO
-
-**Chose subprocess** (llama-server) over direct HIP CGO bindings because:
-
-1. **Maturity**: llama-server is battle-tested with millions of users. Direct HIP CGO would take months to reach comparable stability.
-2. **Model support**: llama.cpp supports 50+ model architectures via GGUF. CGO would start with zero.
-3. **Maintenance**: llama.cpp team handles ROCm compatibility. We just build the binary.
-4. **Isolation**: GPU crashes in the subprocess don't take down the Go process.
-5. **Portability**: Same approach works for NVIDIA (CUDA build), Intel (SYCL build) with minimal code changes.
-
-Trade-offs:
-- Subprocess adds ~50ms latency for first token (process startup + model load)
-- Inter-process communication overhead (HTTP vs in-process)
-- Can't share GPU memory between Go process and llama-server
-
-The go-mlx package uses direct CGO because MLX is a C library designed for embedding. llama.cpp's primary API is its server mode.
-
-### VRAM Budget (16GB)
-
-| Model | Quant | VRAM (model) | Context (4K) | Total | Fits? |
-|-------|-------|-------------|-------------|-------|-------|
-| Qwen3-8B | Q4_K_M | ~5GB | ~0.5GB | ~5.5GB | Yes |
-| Gemma3-4B | Q4_K_M | ~3GB | ~0.3GB | ~3.3GB | Yes |
-| Llama3-8B | Q4_K_M | ~5GB | ~0.5GB | ~5.5GB | Yes |
-| Qwen3-8B | Q8_0 | ~9GB | ~0.5GB | ~9.5GB | Yes |
-| Llama3-70B | Q4_K_M | ~40GB | ~2GB | ~42GB | No (partial offload) |
-
-16GB VRAM comfortably runs any 8B model in Q4 or Q8 quantisation. 13B models fit in Q4. Larger models need partial GPU offload (GPULayers option).
-
----
-
-## 2026-02-19: Sibling Architecture (go-mlx comparison)
-
-| Aspect | go-mlx (macOS) | go-rocm (Linux) |
-|--------|---------------|-----------------|
-| GPU | Apple Metal (M-series) | AMD ROCm (RDNA 3) |
-| Build tag | `darwin && arm64` | `linux && amd64` |
-| Approach | Direct CGO (mlx-c) | Subprocess (llama-server) |
-| Model format | Safetensors | GGUF |
-| Shared interface | `go-inference.TextModel` | `go-inference.TextModel` |
-| Memory control | `SetCacheLimit`, `GetActiveMemory` | `rocm-smi` / HIP API |
-| Chat templates | Built into model code | llama-server `--chat-template` |
-
-Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
-
----
-
-## 2026-02-19: Phase 0 Environment Validation (Charon)
-
-### Actual Hardware (corrected from Virgil's notes)
-
-- **GPU arch**: gfx1100 (NOT gfx1101 — `rocminfo` confirms)
-- **ROCm version**: 7.2.0 (newer than the 6.x minimum)
-- **Kernel**: 6.17.0-14-generic
-- **`/dev/kfd`**: Present, working
-- **HSA_OVERRIDE_GFX_VERSION**: Not needed — native gfx1100
-
-### llama-server Build
-
-- **Source**: llama.cpp commit `11c325c` (cloned 19 Feb 2026)
-- **Local build path**: `/home/claude/llama.cpp/build/bin/llama-server`
-- **Installed to**: `/usr/local/bin/llama-server`
-- **Build command**:
-  ```bash
-  cmake -B build \
-      -DGGML_HIP=ON \
-      -DAMDGPU_TARGETS=gfx1100 \
-      -DGGML_HIP_ROCWMMA_FATTN=ON \
-      -DCMAKE_BUILD_TYPE=Release
-  cmake --build build --parallel $(nproc) -t llama-server
-  ```
-
-### Critical: iGPU Crash
-
-**The Ryzen 9 9950X has an integrated GPU** that ROCm detects as a second device:
-- Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real)
-- Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading)
-
-llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes **`ROCm error: unspecified launch failure`** and crashes with a core dump.
-
-**Fix**: Set `HIP_VISIBLE_DEVICES=0` to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server.
-
-### Baseline Benchmarks — Gemma3-4B-Q4_K_M
-
-| Metric | Value |
-|--------|-------|
-| Model | LEK-Gemma3-4B-Q4_K_M (2.66 GiB) |
-| VRAM used | ~3.4 GiB of 16 GiB |
-| Prefill (prompt) | 396 tok/s (2.5ms/tok) |
-| Decode (generation) | 109 tok/s (9.2ms/tok) |
-| Time to first token | ~40ms (16 token prompt) |
-| Startup time | ~6s (load + warmup) |
-| Context window | 4096 (model supports 131072) |
-| Flash attention | Auto-enabled |
-| Slots | 4 concurrent |
-
-### GGUF Models Available
-
-All at `/data/lem/gguf/` (SMB mount from M3):
-
-| Model | Size | Fits 16GB? |
-|-------|------|-----------|
-| LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes |
-| LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes |
-| LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes |
-| LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes |
-| LEK-Gemma3-27B-Q4_K_M | ~16G | Tight |
-| LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes |
-| LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes |
-| LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes |
-
-### Environment Variables for go-rocm
-
-The server.go implementation MUST set these when spawning:
-
-```go
-cmd.Env = append(os.Environ(),
-    "HIP_VISIBLE_DEVICES=0",  // Critical: mask iGPU to prevent crash
-)
-```
-
-### Model Path Note
-
-Models are on SMB mount (`/data` = `//10.69.69.108/Data`). For CI/testing, copy a small model locally or use `t.Skip()` when the mount is unavailable.
-
----
-
-## 2026-02-19: Phase 1 Plan Review — Interface Questions
-
-### QUESTION: Token.ID not populated by llama-server SSE
-
-llama-server's OpenAI-compatible streaming API (`/v1/chat/completions`, `/v1/completions`) does not include token IDs in the default SSE response. The `inference.Token` struct has `ID int32` and `Text string` — go-rocm will set `Text` but leave `ID` as 0 for all tokens.
-
-Token IDs are available via `logprobs: true` in the request, but this adds overhead and requires parsing the `logprobs.tokens` field.
-
-**Decision needed from Virgil:** Does any consumer (go-ml, go-i18n, go-ai) rely on `Token.ID`? If only `Token.Text` is used downstream, ID=0 is acceptable for Phase 1. If ID is needed, we'll add logprobs parsing.
-
-**ANSWER (Charon, 19 Feb 2026):** Token.ID = 0 is acceptable for Phase 1. No downstream consumer uses Token.ID today — go-ml's scoring engine and go-i18n both only read Token.Text. If a consumer needs IDs later, add logprobs parsing in Phase 2. Don't over-engineer now.
-
-### QUESTION: StopTokens type mismatch
-
-`GenerateConfig.StopTokens` is `[]int32` (token IDs), but llama-server's OpenAI-compatible API expects `"stop"` as `[]string` (text sequences). These are fundamentally different — token IDs cannot be mapped to stop strings without a tokeniser.
-
-Options:
-1. Ignore `StopTokens` in go-rocm Phase 1 (no consumer uses it yet)
-2. Use llama-server's native `/completion` endpoint which supports `id_slot` stop tokens
-3. Add `StopStrings []string` to `GenerateConfig` in go-inference alongside the existing `StopTokens []int32`, let each backend use whichever it supports
-
-**Decision needed from Virgil:** Which approach? Option 3 would be a go-inference interface change. Option 1 is simplest for now — go-rocm silently ignores StopTokens if set.
-
-**ANSWER (Charon, 19 Feb 2026):** Option 1 — ignore StopTokens in Phase 1. No consumer uses them yet. The go-inference interface change (Option 3) should come from a real need, not a hypothetical one. YAGNI.
-
----
-
-## 2026-02-19: Phase 1 Plan Review (Charon)
-
-### Verdict: Approved
-
-Design and implementation plan reviewed. The layered architecture (internal/llamacpp → server → model → backend) is correct. 8-task TDD breakdown is solid. Tasks 1-6 unit-testable without GPU, Task 7 needs hardware.
-
-### Notes for Implementation
-
-1. **guessModelType() filename parsing** — Pragmatic but fragile. Fine for Phase 1. llama-server's `/props` endpoint returns the actual architecture. Note as a Phase 2 upgrade.
-
-2. **serverEnv() HIP_VISIBLE_DEVICES override** — Current approach appends `HIP_VISIBLE_DEVICES=0` to `os.Environ()`. If the user already has `HIP_VISIBLE_DEVICES` set, both values exist in the env slice. Last-write-wins behaviour depends on the kernel and is platform-specific. Safer to filter the existing value out first:
-
-   ```go
-   func serverEnv() []string {
-       env := os.Environ()
-       filtered := make([]string, 0, len(env)+1)
-       for _, e := range env {
-           if !strings.HasPrefix(e, "HIP_VISIBLE_DEVICES=") {
-               filtered = append(filtered, e)
-           }
-       }
-       return append(filtered, "HIP_VISIBLE_DEVICES=0")
-   }
-   ```
-
-3. **`//go:build rocm` for integration tests** — Good call. Keeps `go test ./...` fast on machines without GPU.
-
----
-
-## 2026-02-19: Phase 2 Robustness (Charon)
-
-### Concurrent Requests
-
-Tested 3 goroutines calling Generate() simultaneously on the same model (Gemma3-1B, llama-server with default settings). All 3 received output (~0.9s total). llama-server handles concurrency via its slot system — default is 1 slot, so requests are serialised server-side.
-
-For true parallel inference, use `--parallel N` flag in llama-server (not yet configurable via go-rocm). VRAM cost scales with number of slots and context size.
-
-### VRAM Monitoring
-
-Reading sysfs directly (`/sys/class/drm/cardN/device/mem_info_vram_*`) instead of spawning `rocm-smi`. Auto-detects dGPU by selecting the card with the largest VRAM total:
-- card0 = iGPU (2GB) — Ryzen 9 9950X integrated
-- card1 = dGPU (16GB) — RX 7800 XT
-
-Note: sysfs reads are non-atomic. Total and Used are read separately, so transient inconsistencies are possible under heavy allocation churn. Free is clamped to prevent uint64 underflow.
-
-### lastErr Design Limitation
-
-`rocmModel.lastErr` is a single mutex-protected field shared across all callers. With concurrent Generate/Chat calls, errors can be clobbered (last writer wins). `Err()` is only reliable in single-caller scenarios. This matches the go-inference interface contract (single `Err() error` method), so it's a known limitation, not a bug. Per-call error returns would require an interface change in go-inference.
-
----
-
-## 2026-02-19: Phase 3 Model Support (Charon)
-
-### GGUF Metadata Parser
-
-New `internal/gguf/` package reads GGUF v2/v3 binary headers. Extracts metadata KV pairs without reading tensor data (<1ms per file). Supports all 13 GGUF value types (uint8..float64, string, array, bool). String length capped at 1 MiB to prevent memory exhaustion from malformed files. Handles uint64 values for context_length/block_count (some producers use uint64 instead of uint32).
-
-### Model Inventory
-
-Discovered models from `/data/lem/gguf/` using GGUF metadata:
-
-| Model | Architecture | Size | Quant | Context | Blocks |
-|-------|-------------|------|-------|---------|--------|
-| Gemma3-1B Q5_K_M | gemma3 | 1B | Q5_K_M | 32768 | 26 |
-| Gemma3-1B Q8_0 | gemma3 | 1B | Q8_0 | 32768 | 26 |
-| Gemma3-4B Q4_K_M | gemma3 | 4B | Q4_K_M | 131072 | 34 |
-| Gemma3-12B Q4_K_M | gemma3 | 12B | Q4_K_M | 131072 | 42 |
-| Gemma3-27B Q4_K_M | gemma3 | 27B | Q4_K_M | 131072 | 46 |
-| Llama-3.1-8B Q4_K_M | llama | 8B | Q4_K_M | 131072 | 32 |
-| Mistral-7B-v0.3 Q4_K_M | llama | 7B | Q4_K_M | 32768 | 32 |
-| Qwen-2.5-7B Q4_K_M | qwen2 | 7B | Q4_K_M | 32768 | 28 |
-
-Key observations:
-- Mistral-7B-v0.3 reports `general.architecture = "llama"` (correct — Mistral is a Llama architecture variant). Old `guessModelType` returned "mistral", GGUF metadata returns "llama".
-- Qwen-2.5-7B reports `general.architecture = "qwen2"` (not "qwen3"). Old `guessModelType` would have returned "qwen" due to filename matching.
-- Gemma3-4B/12B/27B have 131072 native context — without auto-capping at 4096, these would exhaust VRAM.
-
-### Chat Templates
-
-llama-server reads `tokenizer.chat_template` from the GGUF and applies it automatically on `/v1/chat/completions`. No go-rocm code needed. Verified working with Gemma3 integration tests.
-
-### Context Window Auto-Detection
-
-Default context capped at `min(model_context_length, 4096)` when user doesn't specify `inference.WithContextLen(N)`. Without this cap, Llama-3.1 would try to allocate 131072 context (~4GB KV cache), which combined with model weights would not fit in 16GB VRAM for larger models.
-
----
-
-## 2026-02-19: Phase 4 Performance (Charon)
-
-### Benchmark Results — RX 7800 XT (gfx1100, ROCm 7.2.0)
-
-All benchmarks run with `ctx=2048`, `testing.B`, `benchtime=3x`.
-
-#### Decode Speed (128 tokens)
-
-| Model | tok/s | VRAM Used |
-|-------|-------|-----------|
-| Gemma3-4B-Q4_K_M | 102.5 | 4724 MiB |
-| Llama-3.1-8B-Q4_K_M | 77.1 | 6482 MiB |
-| Qwen-2.5-7B-Q4_K_M | 84.4 | 6149 MiB |
-
-#### Time-to-First-Token
-
-| Model | TTFT |
-|-------|------|
-| Gemma3-4B-Q4_K_M | 13.8 ms |
-| Llama-3.1-8B-Q4_K_M | 17.1 ms |
-| Qwen-2.5-7B-Q4_K_M | 16.8 ms |
-
-#### Concurrent Throughput (4 parallel slots, 4 goroutines, 32 tokens each)
-
-| Model | Aggregate tok/s | vs Single |
-|-------|----------------|-----------|
-| Gemma3-4B-Q4_K_M | 238.9 | 2.3x |
-| Llama-3.1-8B-Q4_K_M | 166.2 | 2.2x |
-| Qwen-2.5-7B-Q4_K_M | 178.0 | 2.1x |
-
-Parallel slots give ~2.2x throughput improvement with 4 concurrent requests. Per-request latency increases but aggregate throughput scales well.
-
-### Flash Attention Comparison
-
-Compared llama-server built with `-DGGML_HIP_ROCWMMA_FATTN=ON` vs without, at ctx=2048:
-
-| Model | With FA (tok/s) | Without FA (tok/s) | Difference |
-|-------|----------------|-------------------|------------|
-| Gemma3-4B | 102.5 | 107.2 | -4.4% |
-| Llama-3.1-8B | 77.1 | 77.7 | -0.9% |
-| Qwen-2.5-7B | 84.4 | 84.4 | 0% |
-
-**Conclusion:** Flash attention shows no benefit at ctx=2048. rocWMMA flash attention is designed for large context windows where the KV cache becomes a bottleneck. At 2048 context, standard attention is faster (or equal). Flash attention benefits would appear at ctx=8192+ where the quadratic attention cost dominates. Keeping FA enabled is harmless — it auto-activates only when beneficial.
-
-### Parallel Slots
-
-Added `ParallelSlots int` to go-inference's `LoadConfig` and `WithParallelSlots(n int) LoadOption`. go-rocm passes `--parallel N` to llama-server. Each slot allocates its own KV cache, so VRAM usage scales with `parallelSlots * contextLen`. With 4 slots at ctx=2048, VRAM overhead is modest (~200 MiB extra for Gemma3-4B).
diff --git a/TODO.md b/TODO.md
deleted file mode 100644
index 4991427..0000000
--- a/TODO.md
+++ /dev/null
@@ -1,82 +0,0 @@
-# TODO.md — go-rocm Task Queue
-
-Dispatched from core/go orchestration. Pick up tasks in order.
-
----
-
-## Phase 0: Environment Setup (on Linux homelab)
-
-- [x] **Install ROCm 6.x** — ROCm 7.2.0 already installed. `rocm-smi` shows RX 7800 XT (gfx1100). Kernel 6.17.0. (Charon, 19 Feb 2026)
-- [x] **Build llama-server with HIP** — Built from llama.cpp `11c325c`. Installed to `/usr/local/bin/llama-server`. (Charon, 19 Feb 2026)
-- [x] **Test manual inference** — Gemma3-4B-Q4_K_M: 109 tok/s decode, 396 tok/s prefill. See FINDINGS.md for full results. (Charon, 19 Feb 2026)
-- [x] **HSA_OVERRIDE_GFX_VERSION benchmark** — N/A: GPU is actually gfx1100 (not gfx1101 as Virgil noted). No override needed. (Charon, 19 Feb 2026)
-
-### Critical Discovery: iGPU Crash
-
-The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM), and crashes llama-server when it tries to split tensors across devices. **`HIP_VISIBLE_DEVICES=0` is REQUIRED** when spawning llama-server. See FINDINGS.md for details.
-
-## Phase 1: Core Implementation
-
-- [x] **GPU detection** — `Available()` checks `/dev/kfd` + `findLlamaServer()`. Commit `1d8d65f`. (19 Feb 2026)
-- [x] **Server lifecycle** — `server.go`: spawn, health poll (100ms/60s timeout), SIGTERM/SIGKILL shutdown. `serverEnv()` filters HIP_VISIBLE_DEVICES. Commit `9aa7f62`. (19 Feb 2026)
-- [x] **HTTP client** — `internal/llamacpp/`: health check, SSE parser, ChatComplete + Complete with `iter.Seq[string]`. Commits `3c75677`, `def3167`. (19 Feb 2026)
-- [x] **TextModel implementation** — `model.go`: wraps llamacpp client, maps inference types, mutex-protected Err(). Commit `a8c4947`. (19 Feb 2026)
-- [x] **Integration test** — 3 tests (Generate, Chat, ContextCancellation) with Gemma3-1B on RX 7800 XT. All pass. Commit `0e68d71`. (19 Feb 2026)
-
-## Phase 2: Robustness
-
-- [x] **Server crash recovery** — `server.alive()` detects process exit; Generate/Chat return error immediately if dead. Commits `2c4966e`, `c07f37a`. (Charon, 19 Feb 2026)
-- [x] **Port conflict handling** — `startServer()` retries up to 3 times with new port on process exit. Only retries on exit, not timeout. Commits `c50a8e9`, `b7342ec`. (Charon, 19 Feb 2026)
-- [x] **Graceful shutdown** — Already worked in Phase 1. Integration test confirms server survives context cancellation and generates again. Commit `a6e647c`. (Charon, 19 Feb 2026)
-- [x] **Memory monitoring** — `GetVRAMInfo()` reads sysfs, auto-detects dGPU by largest VRAM. Uint64 underflow guard on Free. Commits `501de83`, `954c570`. (Charon, 19 Feb 2026)
-- [x] **Concurrent requests** — 3 goroutines calling Generate() simultaneously all get output. llama-server serialises via 1 slot (default). Commit `a6e647c`. (Charon, 19 Feb 2026)
-
-## Phase 3: Model Support
-
-- [x] **GGUF metadata parser** — `internal/gguf/` reads GGUF v2/v3 binary headers. Extracts architecture, name, file type, size label, context length, block count. String length limits for malformed input protection. Commit `c7c9389`. (Charon, 19 Feb 2026)
-- [x] **GGUF model discovery** — `DiscoverModels(dir)` scans directory for `.gguf` files, parses metadata via GGUF parser, returns `[]ModelInfo`. Commit `af23565`. (Charon, 19 Feb 2026)
-- [x] **LoadModel enrichment** — Replaced `guessModelType` with GGUF metadata for real architecture. Auto-caps context at 4096 when user doesn't specify. Commit `2c77f6f`. (Charon, 19 Feb 2026)
-- [x] **Chat templates** — llama-server reads `tokenizer.chat_template` from GGUF natively on `/v1/chat/completions`. No go-rocm code needed. Verified with Gemma3 integration test. (Charon, 19 Feb 2026)
-- [x] **Context window sizing** — Auto-detected from GGUF metadata. Default caps at `min(model_context_length, 4096)` to prevent VRAM exhaustion. (Charon, 19 Feb 2026)
-
-## Phase 4: Performance
-
-- [x] **Benchmark suite** — Go testing.B benchmarks: BenchmarkDecode, BenchmarkTTFT, BenchmarkConcurrent across Gemma3-4B, Llama3.1-8B, Qwen2.5-7B. Commit `870ee23`. (Charon, 19 Feb 2026)
-- [x] **Flash attention** — Compared with/without `-DGGML_HIP_ROCWMMA_FATTN=ON`. No significant difference at ctx=2048. Benefits expected at larger context windows. See FINDINGS.md. (Charon, 19 Feb 2026)
-- [x] **Batch inference** — Added `ParallelSlots` to go-inference LoadConfig. `--parallel N` passed to llama-server. 4-slot concurrent: Gemma3-4B 239 tok/s aggregate (vs 103 single). Commits `3719734` (go-inference), `72120bb` (go-rocm). (Charon, 19 Feb 2026)
-
-## Phase 5: Alternative Backends
-
-- [ ] **Direct HIP/CGO** — Evaluate whether direct HIP CGO bindings (like go-mlx does for Metal) would be worth the effort. Only if llama-server subprocess becomes a bottleneck.
-- [ ] **vLLM backend** — vLLM supports ROCm and has better batching. Could be an alternative subprocess backend for high-throughput scenarios.
-
----
-
-## Model Inventory (on Linux homelab)
-
-Download to `/data/models/` (or wherever the homelab stores data):
-- [ ] Qwen3-8B-Q4_K_M.gguf (~5GB, fits 16GB VRAM with room for context)
-- [ ] Gemma3-4B-Q4_K_M.gguf (~3GB)
-- [ ] Llama-3.1-8B-Q4_K_M.gguf (~5GB)
-
-## Environment Variables
-
-| Variable | Default | Purpose |
-|----------|---------|---------|
-| `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
-| `HIP_VISIBLE_DEVICES` | `0` (MUST set) | Mask iGPU — Ryzen 9 iGPU crashes llama-server |
-| `HSA_OVERRIDE_GFX_VERSION` | unset | Not needed (GPU is native gfx1100) |
-| `ROCM_MODEL_DIR` | none | Default directory for model discovery |
-
-## Upstream Dependencies
-
-- **go-inference** defines the TextModel/Backend interfaces this package implements
-- **go-ml** will wrap this backend (Virgil creates backend_rocm.go when the API is ready)
-- **go-i18n** may use this for batch classification on Linux (Phase 4)
-
-## Workflow
-
-1. Virgil in core/go writes tasks here after research
-2. This repo's session (on Linux homelab) picks up tasks in phase order
-3. Mark `[x]` when done, note commit hash
-4. New discoveries → add tasks, flag in FINDINGS.md
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..278a949
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,197 @@
+# go-rocm Architecture
+
+## Overview
+
+go-rocm provides AMD ROCm GPU inference for Linux by managing llama-server as a subprocess. It implements the `inference.Backend` and `inference.TextModel` interfaces from go-inference, making the AMD GPU available to the broader Go ML ecosystem (go-ml, go-ai, go-i18n) without any CGO in the package itself.
+
+Module path: `forge.lthn.ai/core/go-rocm`
+
+## Design Choice: Subprocess over CGO
+
+The package uses llama-server (from llama.cpp) as a managed subprocess rather than direct HIP CGO bindings. This decision was deliberate:
+
+- llama-server supports 50+ model architectures via GGUF without any additional work in this package
+- llama.cpp's ROCm/HIP compatibility is maintained by the llama.cpp team
+- GPU crashes in the subprocess cannot take down the host Go process
+- The same subprocess pattern works for NVIDIA (CUDA build) and Intel (SYCL build) with minimal code changes
+
+The trade-offs are modest: a small HTTP overhead compared to in-process function calls, and an additional ~50ms latency during model load while the server process starts. For inference workloads these costs are negligible.
+
+The sibling package go-mlx takes the CGO approach because MLX is a C library designed for embedding. llama.cpp's primary supported interface is its server mode.
+
+## Package Structure
+
+```
+go-rocm/
+├── rocm.go              Package doc and exported types (VRAMInfo, ModelInfo)
+├── register_rocm.go     linux && amd64 — auto-registers via init()
+├── rocm_stub.go         !linux || !amd64 — safe stubs for cross-compilation
+├── backend.go           inference.Backend implementation
+├── model.go             inference.TextModel implementation
+├── server.go            llama-server lifecycle management
+├── vram.go              VRAM monitoring via sysfs
+├── discover.go          GGUF model discovery
+└── internal/
+    ├── llamacpp/
+    │   ├── health.go    HTTP client and health check
+    │   └── client.go    OpenAI-compatible streaming client
+    └── gguf/
+        └── gguf.go      GGUF v2/v3 binary metadata parser
+```
+
+## Build Tags
+
+The package uses build constraints to ensure correctness across platforms:
+
+- `//go:build linux && amd64` — all GPU-touching code: `backend.go`, `model.go`, `server.go`, `vram.go`, `register_rocm.go`
+- `//go:build !linux || !amd64` — `rocm_stub.go` provides `ROCmAvailable() bool { return false }` and a `GetVRAMInfo()` that returns an error, allowing the package to compile everywhere
+- `//go:build rocm` — integration tests and benchmarks, gated behind an explicit tag to keep `go test ./...` fast on machines without GPU hardware
+- `discover.go` has no build constraint; GGUF file parsing is portable
+
+## Auto-Registration
+
+On Linux/amd64, `register_rocm.go` calls `inference.Register(&rocmBackend{})` in an `init()` function. Any program that blank-imports go-rocm gets the backend automatically:
+
+```go
+import _ "forge.lthn.ai/core/go-rocm"
+```
+
+The backend is then available to `inference.LoadModel()` from go-inference, which iterates registered backends and calls `Available()` on each to select one.
+
+## Inference Flow
+
+### 1. Availability Check
+
+`rocmBackend.Available()` verifies two conditions:
+
+- `/dev/kfd` exists — confirms the amdgpu kernel driver is loaded and ROCm is functional
+- `findLlamaServer()` succeeds — checks `ROCM_LLAMA_SERVER_PATH` env var first, then PATH
+
+If either check fails, `Available()` returns false and the backend is skipped.
+
+### 2. Model Loading
+
+`LoadModel(path, opts...)` orchestrates the full startup sequence:
+
+1. Calls `findLlamaServer()` to locate the binary
+2. Calls `gguf.ReadMetadata(path)` to extract the model's native context length and architecture without loading tensors
+3. Applies the context length cap: `min(model_context_length, 4096)` when the caller has not specified a context length explicitly. This prevents VRAM exhaustion on models with 128K+ native context
+4. Calls `startServer()` with the resolved parameters
+5. Returns a `*rocmModel` wrapping the running server
+
+### 3. Server Lifecycle
+
+`startServer()` in `server.go` manages the subprocess:
+
+**Port selection**: `freePort()` asks the kernel for an available TCP port by listening on `127.0.0.1:0` and recording the assigned port before closing the listener.
+
+**Environment preparation**: `serverEnv()` copies the current process environment, strips any existing `HIP_VISIBLE_DEVICES` entry (even if the operator has set it to something else), and appends `HIP_VISIBLE_DEVICES=0`. This is critical: the Ryzen 9 9950X's integrated GPU appears as ROCm Device 1 and reports approximately 100 GB free (it is using system RAM). Without masking, llama-server's auto-fit logic splits tensors across both devices and crashes with `ROCm error: unspecified launch failure`.
+
+**Process start**: `exec.Command` spawns llama-server with:
+```
+--model <path>
+--host 127.0.0.1
+--port <port>
+--n-gpu-layers <layers>   (999 by default = all layers on GPU)
+--ctx-size <N>            (when specified)
+--parallel <N>            (when ParallelSlots > 0)
+```
+
+**Readiness polling**: `waitReady()` polls `GET /health` every 100ms with a 60-second deadline. It selects across three channels simultaneously: the context deadline, the `exited` channel (process died before becoming ready), and the ticker. Model load time is typically 6–10 seconds for a 4–8B model.
+
+**Retry on port conflict**: If the process exits during startup (exit before the health check passes), `startServer()` retries up to 3 times with a freshly selected port. Timeouts are not retried — a stuck server is a different failure mode.
+
+**Shutdown**: `server.stop()` sends SIGTERM and waits up to 5 seconds for a clean exit. If the process has not exited after 5 seconds, it sends SIGKILL and waits for the channel to close.
+
+### 4. Token Streaming
+
+`rocmModel.Generate()` maps to `/v1/completions`. `rocmModel.Chat()` maps to `/v1/chat/completions`. Both:
+
+1. Check `server.alive()` by reading from the `exited` channel non-blockingly. If the server has died, an error is recorded in `lastErr` and an empty iterator is returned immediately
+2. Build the request struct with sampling parameters (temperature, top-k, top-p, repeat penalty, max tokens)
+3. Call the appropriate client method, which returns `(iter.Seq[string], func() error)`
+4. Wrap the chunk iterator into an `iter.Seq[inference.Token]`, setting `Token.Text` from each chunk and leaving `Token.ID` as zero (llama-server's OpenAI-compatible streaming API does not return token IDs)
+5. After the iterator completes, call the error function and store any error in `lastErr` under the mutex
+
+The SSE parser in `internal/llamacpp/client.go` uses a `bufio.Scanner` to read `data: ` prefixed lines, stops at `[DONE]`, and propagates scan errors via a pointer. Response bodies are closed exactly once via `sync.Once`.
+
+### 5. Chat Templates
+
+llama-server reads `tokenizer.chat_template` from the GGUF file and applies it automatically on the `/v1/chat/completions` endpoint. go-rocm does not implement any template logic.
+
+## GGUF Metadata Parser
+
+`internal/gguf/` is a standalone binary metadata reader. It supports GGUF v2 (uint32 tensor/KV counts) and v3 (uint64 counts).
+
+The parser reads the file header sequentially:
+
+1. Magic number validation (`0x46554747`, the ASCII string "GGUF" in little-endian)
+2. Version field (2 or 3; others return an error)
+3. Tensor count and KV count (width depends on version)
+4. All KV pairs in sequence
+
+For each KV pair, the key string is read first, then the value type, then the value. Interesting keys are:
+- `general.architecture` — architecture identifier (e.g. `gemma3`, `llama`, `qwen2`)
+- `general.name` — human-readable model name
+- `general.file_type` — GGML quantisation type code
+- `general.size_label` — parameter count label (e.g. `1B`, `8B`)
+- Any key with suffix `.context_length`
+- Any key with suffix `.block_count`
+
+Architecture-specific keys like `llama.context_length` are collected into candidate maps and resolved after the architecture is known. Uninteresting keys are skipped without allocation.
+
+String values are capped at 1 MiB to prevent memory exhaustion from malformed files. `uint64` values for context length and block count are downcast to `uint32` when they fit (some producers write uint64 for these fields).
+
+The parser reads only the header, not tensor data. Parsing a 5 GB model file takes under 1 ms.
+
+## VRAM Monitoring
+
+`GetVRAMInfo()` reads `mem_info_vram_total` and `mem_info_vram_used` from sysfs (`/sys/class/drm/cardN/device/`). It identifies the discrete GPU by selecting the card with the largest VRAM total, which correctly distinguishes the RX 7800 XT (16 GB) from the Ryzen iGPU (2 GB) without hardcoding card numbers.
+
+`Free` is computed as `Total - Used` with a guard against uint64 underflow: if `Used > Total` due to a non-atomic sysfs read during heavy allocation, `Free` is clamped to zero.
+
+## Model Discovery
+
+`DiscoverModels(dir)` globs for `*.gguf` files in a directory, calls `gguf.ReadMetadata()` on each, and returns a `[]ModelInfo` slice. Files that fail to parse are silently skipped.
+
+## go-inference Interface Contract
+
+The package implements two interfaces from `forge.lthn.ai/core/go-inference`:
+
+**inference.Backend**:
+- `Name() string` — returns `"rocm"`
+- `Available() bool` — /dev/kfd + llama-server present
+- `LoadModel(path string, opts ...LoadOption) (TextModel, error)` — spawns llama-server
+
+**inference.TextModel**:
+- `Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]`
+- `Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]`
+- `ModelType() string` — GGUF architecture string
+- `Err() error` — last error from Generate/Chat, mutex-protected
+- `Close() error` — SIGTERM/SIGKILL shutdown
+
+Known limitation: `Err()` is a single shared field. With concurrent Generate/Chat calls on the same model, errors from simultaneous callers can overwrite each other (last writer wins). This is a known constraint of the go-inference interface design, not a bug in this package.
+
+`StopTokens []int32` from `GenerateConfig` is ignored. llama-server's OpenAI-compatible API accepts stop sequences as strings, not token IDs, and mapping between them requires a tokeniser. No current consumer of go-rocm uses StopTokens.
+
+## Concurrency and Parallel Slots
+
+llama-server serialises concurrent requests through its slot system. With the default of one slot, simultaneous calls to `Generate()` on the same model are queued server-side. Aggregate throughput still scales because the GPU is not idle during serialised requests.
+
+`inference.WithParallelSlots(n)` passes `--parallel N` to llama-server, enabling true parallel inference across N context slots. Each slot maintains its own KV cache, so VRAM usage scales with `parallelSlots * contextLen`. With 4 slots at ctx=2048 on the RX 7800 XT, the additional VRAM cost is approximately 200 MiB for Gemma3-4B.
+
+## go-inference Ecosystem Position
+
+```
+go-inference    — shared TextModel/Backend interfaces (no deps)
+     |
+go-rocm         — AMD ROCm backend (this package)
+go-mlx          — Apple Metal backend (macOS, CGO, Safetensors)
+     |
+go-ml           — scoring engine, wraps both backends transparently
+     |
+go-ai           — MCP server + facade, imports go-ml
+go-i18n         — grammar engine, may use for batch classification
+```
+
+go-rocm registers itself automatically. go-ml selects the appropriate backend at runtime based on `Available()`.
diff --git a/docs/development.md b/docs/development.md
new file mode 100644
index 0000000..b04895e
--- /dev/null
+++ b/docs/development.md
@@ -0,0 +1,243 @@
+# go-rocm Development Guide
+
+## Prerequisites
+
+### Hardware
+
+- AMD GPU with ROCm support. Tested hardware: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM)
+- Linux, amd64. The package does not build or run on any other platform
+
+### Operating System
+
+- Ubuntu 24.04 LTS (recommended; also supported: Ubuntu 22.04.5)
+- Kernel 6.10+ recommended for RDNA 3 stability. The homelab currently runs 6.17.0
+- The amdgpu kernel driver must be loaded (`/dev/kfd` must be present)
+
+### ROCm
+
+Install ROCm 6.x or later. ROCm 7.2.0 is installed on the homelab:
+
+```bash
+sudo apt install rocm-dev rocm-libs
+rocm-smi           # verify GPU is detected
+rocminfo           # verify gfx architecture
+```
+
+Confirm `/dev/kfd` exists and is accessible to your user. Add yourself to the `render` and `video` groups if needed:
+
+```bash
+sudo usermod -aG render,video $USER
+```
+
+### llama-server
+
+llama-server must be built from llama.cpp with HIP/ROCm support. The package does not ship or download the binary.
+
+**Build steps** (from the homelab):
+
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+
+cmake -B build \
+    -DGGML_HIP=ON \
+    -DAMDGPU_TARGETS=gfx1100 \
+    -DGGML_HIP_ROCWMMA_FATTN=ON \
+    -DCMAKE_BUILD_TYPE=Release
+
+cmake --build build --parallel $(nproc) -t llama-server
+```
+
+The production binary on the homelab was built from commit `11c325c` (cloned 19 Feb 2026). Install to PATH:
+
+```bash
+sudo cp build/bin/llama-server /usr/local/bin/llama-server
+llama-server --version
+```
+
+Alternatively, set `ROCM_LLAMA_SERVER_PATH` to the full binary path.
+
+**Architecture note**: The RX 7800 XT is physically gfx1100. Earlier documentation from Virgil stated gfx1101; `rocminfo` on the actual hardware confirms gfx1100. Use `-DAMDGPU_TARGETS=gfx1100`. No `HSA_OVERRIDE_GFX_VERSION` override is required.
+
+### Go
+
+Go 1.25.5 or later (as specified in `go.mod`). The module uses Go 1.22+ range-over-integer syntax and Go 1.23 `iter.Seq`.
+
+### go-inference
+
+go-rocm depends on `forge.lthn.ai/core/go-inference`. The `go.mod` replaces it with a local path (`../go-inference`). The go-inference directory must be present as a sibling of go-rocm:
+
+```
+Code/
+├── go-rocm/
+└── go-inference/
+```
+
+If checking out go-rocm independently: `go work sync` or adjust the `replace` directive.
+
+## Running Tests
+
+### Unit Tests (no GPU required)
+
+The standard test invocation runs unit tests that do not touch GPU hardware:
+
+```bash
+go test ./...
+```
+
+This covers:
+- `server_test.go` — `findLlamaServer`, `freePort`, `serverEnv`, `server.alive()`, dead-server error handling, retry behaviour
+- `vram_test.go` — sysfs parsing logic
+- `discover_test.go` — model discovery
+- `internal/llamacpp/health_test.go` and `client_test.go` — HTTP client and SSE parser
+- `internal/gguf/gguf_test.go` — GGUF binary parser
+
+Some unit tests in `server_test.go` have the `//go:build linux && amd64` constraint and will only run on Linux. They do not require a GPU but do require llama-server to be present in PATH.
+
+### Integration Tests (GPU required)
+
+Integration tests are gated behind the `rocm` build tag:
+
+```bash
+go test -tags rocm -v -run TestROCm ./...
+```
+
+These tests require:
+- `/dev/kfd` present
+- `llama-server` in PATH or `ROCM_LLAMA_SERVER_PATH` set
+- The test model at `/data/lem/gguf/LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf` (SMB mount from M3)
+
+Each test calls `skipIfNoROCm(t)` and `skipIfNoModel(t)` so they skip cleanly when hardware or the model mount is unavailable.
+
+**Available integration tests:**
+
+| Test | What it verifies |
+|------|-----------------|
+| `TestROCm_LoadAndGenerate` | Full load + Generate, checks architecture from GGUF metadata |
+| `TestROCm_Chat` | Multi-turn Chat with chat template applied by llama-server |
+| `TestROCm_ContextCancellation` | Context cancel stops iteration mid-stream |
+| `TestROCm_GracefulShutdown` | Server survives context cancel; second Generate succeeds |
+| `TestROCm_ConcurrentRequests` | Three goroutines calling Generate simultaneously |
+| `TestROCm_DiscoverModels` | DiscoverModels returns non-empty result for model directory |
+
+### Benchmarks (GPU required)
+
+```bash
+go test -tags rocm -bench=. -benchtime=3x ./...
+```
+
+Benchmarks test three models in sequence (Gemma3-4B, Llama3.1-8B, Qwen2.5-7B). They skip if any model file is absent:
+
+| Benchmark | Metric reported |
+|-----------|----------------|
+| `BenchmarkDecode` | tok/s for 128-token generation |
+| `BenchmarkTTFT` | µs/first-tok (time to first token) |
+| `BenchmarkConcurrent` | tok/s-aggregate with 4 goroutines and 4 parallel slots |
+
+Model load time is excluded from benchmark timing via `b.StopTimer()` / `b.StartTimer()`. VRAM usage is logged after each load via `GetVRAMInfo()`.
+
+**Reference results (RX 7800 XT, ROCm 7.2.0, ctx=2048, benchtime=3x):**
+
+Decode speed:
+
+| Model | tok/s | VRAM Used |
+|-------|-------|-----------|
+| Gemma3-4B-Q4_K_M | 102.5 | 4724 MiB |
+| Llama-3.1-8B-Q4_K_M | 77.1 | 6482 MiB |
+| Qwen-2.5-7B-Q4_K_M | 84.4 | 6149 MiB |
+
+Time to first token:
+
+| Model | TTFT |
+|-------|------|
+| Gemma3-4B-Q4_K_M | 13.8 ms |
+| Llama-3.1-8B-Q4_K_M | 17.1 ms |
+| Qwen-2.5-7B-Q4_K_M | 16.8 ms |
+
+Concurrent throughput (4 parallel slots, 4 goroutines, 32 tokens each):
+
+| Model | Aggregate tok/s | vs single-slot |
+|-------|----------------|---------------|
+| Gemma3-4B-Q4_K_M | 238.9 | 2.3x |
+| Llama-3.1-8B-Q4_K_M | 166.2 | 2.2x |
+| Qwen-2.5-7B-Q4_K_M | 178.0 | 2.1x |
+
+## Environment Variables
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `ROCM_LLAMA_SERVER_PATH` | PATH lookup | Explicit path to llama-server binary |
+| `HIP_VISIBLE_DEVICES` | overridden to `0` | go-rocm always sets this to 0 when spawning llama-server |
+| `HSA_OVERRIDE_GFX_VERSION` | unset | Not required; GPU is native gfx1100 |
+| `ROCM_MODEL_DIR` | none | Conventional directory for model files (not read by go-rocm itself) |
+
+`HIP_VISIBLE_DEVICES=0` is set unconditionally by `serverEnv()`, overriding any value in the calling process's environment. This masks the Ryzen 9 9950X's iGPU (Device 1), which otherwise causes llama-server to crash when it attempts to split tensors across the iGPU and dGPU.
+
+## VRAM Budget
+
+With 16 GB VRAM on the RX 7800 XT, the following models fit comfortably:
+
+| Model | Quant | VRAM (model) | Context 4K | Total | Fits? |
+|-------|-------|-------------|-----------|-------|-------|
+| Qwen3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
+| Gemma3-4B | Q4_K_M | ~3 GB | ~0.3 GB | ~3.3 GB | Yes |
+| Llama3-8B | Q4_K_M | ~5 GB | ~0.5 GB | ~5.5 GB | Yes |
+| Qwen3-8B | Q8_0 | ~9 GB | ~0.5 GB | ~9.5 GB | Yes |
+| Gemma3-12B | Q4_K_M | ~7.5 GB | ~0.8 GB | ~8.3 GB | Yes |
+| Gemma3-27B | Q4_K_M | ~16 GB | ~1.5 GB | ~17.5 GB | Tight |
+| Llama3-70B | Q4_K_M | ~40 GB | ~2 GB | ~42 GB | No (partial offload) |
+
+The context cap (`min(model_context_length, 4096)` by default) is essential for models like Gemma3-4B and Llama-3.1-8B, which have 131072-token native context. Without the cap, the KV cache allocation alone would exhaust VRAM.
+
+## Test Patterns
+
+Tests use `github.com/stretchr/testify/assert` and `require`. The naming convention from the broader go ecosystem applies:
+
+- `_Good` suffix — happy path
+- `_Bad` suffix — expected error conditions
+- `_Ugly` suffix — panic or edge cases
+
+Integration tests use `skipIfNoROCm(t)` and `skipIfNoModel(t)` guards. Never use `t.Fatal` to skip; always use `t.Skip`.
+
+When writing new unit tests that do not need GPU hardware, do not add the `rocm` build tag. The `linux && amd64` tag is sufficient for tests that test Linux-specific code paths.
+
+## Coding Standards
+
+- **Language**: UK English throughout. Colour, organisation, initialise, behaviour — never American spellings
+- **Strict types**: `declare(strict_types=1)` is a PHP convention, but the Go equivalent applies: use concrete types, avoid `any` except where the interface demands it
+- **Error messages**: Lower case, no trailing punctuation. Prefixed with the package context: `"rocm: ..."`, `"llamacpp: ..."`, `"gguf: ..."`
+- **Formatting**: `gofmt` / `goimports`. No exceptions
+- **Licence**: EUPL-1.2. All new files must include the licence header if adding a file header comment
+
+## Conventional Commits
+
+Use the conventional commits format:
+
+```
+type(scope): description
+
+feat(server): add GPU layer count override via environment variable
+fix(gguf): handle uint64 context_length from v3 producers
+test(integration): add DiscoverModels test for SMB mount
+docs(architecture): update VRAM budget table
+```
+
+Types: `feat`, `fix`, `test`, `docs`, `refactor`, `perf`, `chore`
+
+## Co-Authorship
+
+All commits must include the co-author trailer:
+
+```
+Co-Authored-By: Virgil <virgil@lethean.io>
+```
+
+## Adding a New Backend Feature
+
+The typical sequence for a new go-rocm feature:
+
+1. If the feature requires a go-inference interface change (new `LoadOption`, `GenerateOption`, or `TextModel` method), write that change first in go-inference and coordinate with Virgil (the orchestrator) before implementing the consumer side
+2. Write unit tests first; most server and client behaviour is testable without GPU hardware
+3. If integration testing on the homelab is needed, use the `//go:build rocm` tag
+4. Update `docs/architecture.md` if the data flow or component structure changes
+5. Record benchmark results in `docs/history.md` under the relevant phase if performance characteristics change materially
diff --git a/docs/history.md b/docs/history.md
new file mode 100644
index 0000000..02505d1
--- /dev/null
+++ b/docs/history.md
@@ -0,0 +1,154 @@
+# go-rocm Project History
+
+## Origin
+
+go-rocm was created on 19 February 2026 by Virgil (orchestrator) as the AMD GPU backend for the go-inference ecosystem. The sibling package go-mlx provides the same interface on macOS using Apple Metal and direct CGO; go-rocm targets the Linux homelab's AMD Radeon RX 7800 XT.
+
+The package was built by Charon (test coverage and build agent, running on the Linux homelab) in a single day across four phases: environment validation, core implementation, robustness, model support, and performance tuning.
+
+---
+
+## Phase 0: Environment Validation (19 Feb 2026)
+
+**Purpose**: Confirm the homelab hardware, ROCm installation, and llama.cpp build before writing any Go code.
+
+**Findings:**
+
+- GPU architecture confirmed as gfx1100 via `rocminfo`. Virgil's initial notes stated gfx1101; the physical hardware is gfx1100. No `HSA_OVERRIDE_GFX_VERSION` override is required.
+- ROCm version: 7.2.0 (minimum required is 6.x).
+- Kernel: 6.17.0-14-generic.
+- llama.cpp built from commit `11c325c` with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Binary installed to `/usr/local/bin/llama-server`.
+
+**Critical discovery: iGPU crash**
+
+The Ryzen 9 9950X has an integrated GPU that ROCm detects as a second device:
+- Device 0: RX 7800 XT (gfx1100), 16 GB VRAM
+- Device 1: Radeon Graphics iGPU (gfx1100), reports ~100 GB free (system RAM)
+
+llama-server's auto-fit logic splits the model across both devices. Loading tensors to Device 1 triggers `ROCm error: unspecified launch failure` and a core dump. The fix is `HIP_VISIBLE_DEVICES=0`, which must be set unconditionally when spawning llama-server.
+
+**Baseline benchmark (Gemma3-4B-Q4_K_M):**
+
+| Metric | Value |
+|--------|-------|
+| Prefill speed | 396 tok/s |
+| Decode speed | 109 tok/s |
+| Time to first token | ~40 ms (16-token prompt) |
+| Server startup | ~6 s |
+| VRAM used | ~3.4 GB of 16 GB |
+
+---
+
+## Phase 1: Core Implementation (19 Feb 2026)
+
+**Commits**: `1d8d65f`, `9aa7f62`, `3c75677`, `def3167`, `a8c4947`, `0e68d71`
+
+**GPU detection** (`1d8d65f`): `Available()` checks `/dev/kfd` and `findLlamaServer()`. Returns false if either is absent. `findLlamaServer()` checks `ROCM_LLAMA_SERVER_PATH` env var first, then PATH.
+
+**Server lifecycle** (`9aa7f62`): `server.go` implements `startServer()`, `waitReady()`, and `stop()`. Health polling at 100ms intervals with a 60-second startup timeout. Graceful shutdown sends SIGTERM, waits 5 seconds, then SIGKILL. `serverEnv()` filters and overrides `HIP_VISIBLE_DEVICES` using a filter-then-append pattern to avoid duplicate env var entries (last-write-wins is platform-specific).
+
+**HTTP client** (`3c75677`, `def3167`): `internal/llamacpp/` provides `Client` with `Health()`, `ChatComplete()`, and `Complete()`. Both completion methods return `(iter.Seq[string], func() error)`. The SSE parser reads `data: ` prefixed lines from the response body using a `bufio.Scanner`, stops at `[DONE]`, and propagates I/O errors via a pointer.
+
+**TextModel implementation** (`a8c4947`): `model.go` wraps the server and client. `Generate()` calls `/v1/completions`; `Chat()` calls `/v1/chat/completions`. Both check `server.alive()` before dispatching and record errors in `lastErr` under a mutex.
+
+**Integration tests** (`0e68d71`): `TestROCm_LoadAndGenerate`, `TestROCm_Chat`, `TestROCm_ContextCancellation` all pass on the RX 7800 XT using Gemma3-1B. Tests gated behind `//go:build rocm`.
+
+**Design decisions recorded in FINDINGS.md:**
+
+- `Token.ID` left as zero; llama-server's streaming API does not return token IDs. No downstream consumer uses the ID field.
+- `StopTokens []int32` silently ignored; the llama-server API expects stop sequences as strings, not token IDs. YAGNI.
+
+---
+
+## Phase 2: Robustness (19 Feb 2026)
+
+**Commits**: `2c4966e`, `c07f37a`, `c50a8e9`, `b7342ec`, `a6e647c`, `501de83`, `954c570`
+
+**Server crash recovery** (`2c4966e`, `c07f37a`): `server.alive()` reads from the `exited` channel non-blockingly. `Generate()` and `Chat()` return an empty iterator immediately if the server has died, recording the exit error in `lastErr`.
+
+**Port conflict handling** (`c50a8e9`, `b7342ec`): `startServer()` retries up to 3 times with a fresh port on process exit during startup. Timeouts are not retried (a stuck server is a distinct failure mode from a port conflict).
+
+**Graceful shutdown** (`a6e647c`): Integration test `TestROCm_GracefulShutdown` confirms the server survives a mid-stream context cancel and accepts subsequent Generate calls. Already worked from Phase 1; integration test added to prevent regression.
+
+**VRAM monitoring** (`501de83`, `954c570`): `GetVRAMInfo()` reads sysfs (`/sys/class/drm/cardN/device/mem_info_vram_*`). Selects the dGPU by highest total VRAM, correctly distinguishing the RX 7800 XT (16 GB) from the Ryzen iGPU (2 GB) without hardcoding card numbers. Uint64 underflow guard: `Free` is clamped to zero if `Used > Total` due to non-atomic sysfs reads.
+
+**Concurrent requests** (`a6e647c`): Three goroutines calling `Generate()` simultaneously all receive output. llama-server serialises via its default single-slot configuration. No Go-level locking needed on the model for concurrent Generate calls.
+
+**Known limitation recorded**: `Err()` is a single shared field. Concurrent callers can overwrite each other's errors. This matches the go-inference interface contract and is not a bug.
+
+---
+
+## Phase 3: Model Support (19 Feb 2026)
+
+**Commits**: `c7c9389`, `af23565`, `2c77f6f`
+
+**GGUF metadata parser** (`c7c9389`): `internal/gguf/` reads GGUF v2/v3 binary headers. Extracts architecture, name, file type, size label, context length, and block count without reading tensor data. Supports all 13 GGUF value type codes. String length capped at 1 MiB. Handles uint64 values for context_length/block_count (some producers use uint64 instead of uint32). Architecture-specific keys are collected as candidates and resolved after `general.architecture` is known, handling the case where architecture-specific keys appear before the architecture key in the KV stream.
+
+**Model discovery** (`af23565`): `DiscoverModels(dir)` globs for `*.gguf` files, parses each via the GGUF parser, and returns `[]ModelInfo`. Unparseable files are skipped silently.
+
+**LoadModel enrichment** (`2c77f6f`): Replaced filename-based architecture guessing with GGUF metadata. `meta.Architecture` is now set from `general.architecture`, which is more accurate: Mistral-7B-v0.3 correctly reports `"llama"` (not `"mistral"`), and Qwen-2.5-7B correctly reports `"qwen2"`. Context auto-capped at `min(model_context_length, 4096)` when the caller does not specify a context length, preventing VRAM exhaustion on models with 128K+ native context (Gemma3-4B/12B/27B and Llama-3.1-8B all have 131072-token native context).
+
+**Chat templates**: Confirmed that llama-server reads `tokenizer.chat_template` from GGUF and applies it on `/v1/chat/completions`. No go-rocm code required.
+
+**Model inventory discovered** (at `/data/lem/gguf/`):
+
+| Model | Architecture | Quant | Context |
+|-------|-------------|-------|---------|
+| Gemma3-1B-layered-v2 | gemma3 | Q5_K_M / Q8_0 | 32768 |
+| Gemma3-4B | gemma3 | Q4_K_M | 131072 |
+| Gemma3-12B | gemma3 | Q4_K_M | 131072 |
+| Gemma3-27B | gemma3 | Q4_K_M | 131072 |
+| Llama-3.1-8B | llama | Q4_K_M | 131072 |
+| Mistral-7B-v0.3 | llama | Q4_K_M | 32768 |
+| Qwen-2.5-7B | qwen2 | Q4_K_M | 32768 |
+
+---
+
+## Phase 4: Performance (19 Feb 2026)
+
+**Commits**: `870ee23` (benchmarks), `3719734` (go-inference: ParallelSlots), `72120bb` (go-rocm: --parallel support)
+
+**Benchmark suite** (`870ee23`): Three benchmarks gated behind `//go:build rocm`:
+- `BenchmarkDecode` — 128-token generation, reports tok/s
+- `BenchmarkTTFT` — single-token generation, reports µs/first-tok
+- `BenchmarkConcurrent` — 4 goroutines, 4 parallel slots, reports tok/s-aggregate
+
+All three run across Gemma3-4B, Llama3.1-8B, and Qwen2.5-7B. Model load time is excluded via `b.StopTimer()` / `b.StartTimer()`.
+
+**Flash attention comparison**: llama-server built with and without `-DGGML_HIP_ROCWMMA_FATTN=ON` at ctx=2048. No significant difference (≤4.4% variation, within noise). rocWMMA flash attention is designed for large context windows where the KV cache dominates. At ctx=2048, standard attention is as fast or faster. Flash attention auto-activates only when beneficial and does not degrade performance at small context sizes. The flag remains enabled in the build configuration.
+
+**Parallel slots** (`3719734`, `72120bb`): `ParallelSlots int` added to go-inference's `LoadConfig`. `inference.WithParallelSlots(n)` passes `--parallel N` to llama-server. Aggregate throughput with 4 slots at ctx=2048:
+
+| Model | Single-slot tok/s | 4-slot aggregate tok/s | Ratio |
+|-------|------------------|----------------------|-------|
+| Gemma3-4B-Q4_K_M | 102.5 | 238.9 | 2.3x |
+| Llama-3.1-8B-Q4_K_M | 77.1 | 166.2 | 2.2x |
+| Qwen-2.5-7B-Q4_K_M | 84.4 | 178.0 | 2.1x |
+
+---
+
+## Known Limitations
+
+**Token IDs**: `inference.Token.ID` is always zero. llama-server's OpenAI-compatible streaming API does not return token IDs. Adding token IDs would require `logprobs: true` in the request and additional parsing overhead. No current consumer uses token IDs.
+
+**StopTokens**: `GenerateConfig.StopTokens []int32` is ignored. llama-server's `/v1/completions` and `/v1/chat/completions` endpoints accept stop sequences as strings (`"stop": [...]`), not token IDs. Mapping between them requires a tokeniser that is not available in this package. No current consumer uses StopTokens.
+
+**Err() concurrency**: `rocmModel.Err()` returns the last error from any Generate/Chat call. With multiple concurrent callers, errors can be overwritten. The single `Err() error` method is an go-inference interface constraint, not a go-rocm decision. Per-call error returns would require an interface change in go-inference.
+
+**VRAM reads are non-atomic**: `GetVRAMInfo()` reads `mem_info_vram_total` and `mem_info_vram_used` in two separate sysfs reads. Under heavy VRAM allocation churn, transient inconsistency is possible. `Free` is clamped to zero to prevent uint64 underflow.
+
+**Model directory**: Models are on an SMB mount (`/data` = `//10.69.69.108/Data`). Integration tests and benchmarks skip when the mount is unavailable. For offline testing, copy a small model (the 1B Q5_K_M is approximately 0.9 GB) to a local path and update the `testModel` constant in `rocm_integration_test.go`.
+
+**Single-model-per-server**: Each `rocmModel` owns exactly one llama-server subprocess. Loading multiple models simultaneously requires multiple `LoadModel` calls, each consuming its own VRAM share. There is no shared server or model-switching mechanism.
+
+---
+
+## Future Considerations
+
+**Direct HIP CGO** (Phase 5, unscheduled): Direct HIP CGO bindings would eliminate the HTTP overhead and process boundary. Only worth pursuing if the subprocess approach becomes a measurable bottleneck. Estimated cost: months of implementation to match llama.cpp's model support breadth.
+
+**vLLM backend** (Phase 5, unscheduled): vLLM supports ROCm and provides better batching semantics for high-throughput scenarios. Would be a parallel subprocess backend alongside llama-server, selectable via configuration.
+
+**Model-switching**: The current design loads one model per server instance. A pool-based approach could share llama-server instances across model loads, though this would require llama-server to support hot-swapping models (it does not currently).
+
+**go-i18n integration**: go-i18n may use go-rocm for batch text classification on the Linux homelab once Phase 2 of go-i18n is unblocked. The `WithParallelSlots` option makes the backend well-suited for batch workloads.