docs: initial wiki — environment, architecture, interface contract, models, fleet context

Complete knowledge base for the dedicated go-rocm domain expert: - Environment: hardware specs, iGPU crash trap, llama-server setup - Architecture: subprocess design, file layout, coding standards - Interface-Contract: go-inference types to implement with field mappings - Models: GGUF inventory, VRAM budget, baseline benchmarks - Fleet-Context: role, dependencies, communication patterns Charon, 19 Feb 2026
2026-02-19 20:07:58 +00:00 · 2026-02-19 20:07:58 +00:00 · cfc5aacefd
commit cfc5aacefd
6 changed files with 411 additions and 0 deletions
--- a/Architecture.md
+++ b/Architecture.md
@ -0,0 +1,74 @@
+# Architecture
+
+## How It Works
+
+```
+Go process (go-rocm)
+  │
+  ├─ Available()     Checks: /dev/kfd + rocm-smi + llama-server binary
+  │
+  ├─ LoadModel()     Spawns llama-server subprocess
+  │    │
+  │    ├─ HIP_VISIBLE_DEVICES=0 (mask iGPU)
+  │    ├─ --model /path/to/model.gguf
+  │    ├─ --port <random free port>
+  │    ├─ --n-gpu-layers <from LoadConfig.GPULayers>
+  │    ├─ --ctx-size <from LoadConfig.ContextLen>
+  │    └─ Waits for /health endpoint → returns TextModel
+  │
+  ├─ Generate()      POST /v1/completions (SSE streaming)
+  ├─ Chat()          POST /v1/chat/completions (SSE streaming)
+  │    └─ Parses: data: {"choices":[{"delta":{"content":"..."}}]}
+  │
+  └─ Close()         SIGTERM → wait for clean exit
+```
+
+## File Layout
+
+```
+go-rocm/
+├── rocm.go              Package doc (exists)
+├── register_rocm.go     //go:build linux && amd64 — init() registers backend (exists)
+├── rocm_stub.go         //go:build !linux || !amd64 — ROCmAvailable() false (exists)
+├── backend.go           inference.Backend: Name(), Available(), LoadModel() (exists, stubs)
+├── model.go             inference.TextModel: Generate(), Chat(), Err(), Close() (TODO Phase 1)
+├── server.go            llama-server lifecycle: spawn, health poll, shutdown (TODO Phase 1)
+├── go.mod               Module with go-inference dependency (exists)
+├── CLAUDE.md            Build/arch/standards contract (exists)
+├── TODO.md              Phased task queue (exists)
+├── FINDINGS.md          Research notes (exists)
+└── internal/
+    └── llamacpp/
+        ├── client.go    OpenAI-compatible HTTP client with SSE parsing (TODO Phase 1)
+        └── health.go    Health check + readiness probe (TODO Phase 1)
+```
+
+## Design Decisions
+
+### Subprocess over CGO
+
+llama-server (subprocess) was chosen over direct HIP CGO bindings because:
+
+1. **Maturity**: llama-server has millions of users. CGO bindings would start from scratch.
+2. **Model support**: 50+ architectures via GGUF. CGO would start with zero.
+3. **Isolation**: GPU crash in subprocess doesn't kill the Go process.
+4. **Maintenance**: llama.cpp team handles ROCm compat. We just build.
+5. **Portability**: Same pattern works for CUDA (NVIDIA) and SYCL (Intel).
+
+Trade-offs:
+- ~50ms latency for first token (subprocess startup + model load)
+- HTTP overhead vs in-process calls
+- Can't share GPU memory between Go and llama-server
+
+### Sibling: go-mlx
+
+go-mlx (macOS) uses direct CGO because MLX is designed for embedding. Both implement the same `go-inference.TextModel` interface, allowing consumers to import either (or both) transparently.
+
+## Coding Standards
+
+- UK English
+- Tests: `testify` assert/require
+- Conventional commits: `feat:`, `fix:`, `test:`, `docs:`
+- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
+- Licence: EUPL-1.2
+- Build tags: `//go:build linux && amd64` for GPU code, stub for other platforms
--- a/Environment.md
+++ b/Environment.md
@ -0,0 +1,87 @@
+# Environment
+
+Validated by Charon, 19 Feb 2026.
+
+## Hardware
+
+| Component | Detail |
+|-----------|--------|
+| **Machine** | snider-linux (10.69.69.165) |
+| **CPU** | AMD Ryzen 9 9950X (32 threads @ 5.7GHz) |
+| **GPU (dGPU)** | AMD Radeon RX 7800 XT — gfx1100, RDNA 3, 16GB GDDR6, 60 CUs |
+| **GPU (iGPU)** | AMD Radeon Graphics — gfx1100, integrated in CPU |
+| **OS** | Ubuntu (Kernel 6.17.0-14-generic) |
+| **ROCm** | 7.2.0 |
+| **Go** | 1.25.6 |
+
+## iGPU Crash — Critical
+
+The Ryzen 9 9950X iGPU is detected by ROCm as Device 1. It reports ~100GB free memory (actually system RAM, not real VRAM). llama-server's auto-fit logic tries to split model tensors across both devices, causing:
+
+```
+ROCm error: unspecified launch failure
+  in function ggml_backend_cuda_buffer_set_tensor
+```
+
+This crashes with a core dump every time.
+
+**Fix**: Always set `HIP_VISIBLE_DEVICES=0` before spawning llama-server:
+
+```go
+cmd.Env = append(os.Environ(), "HIP_VISIBLE_DEVICES=0")
+```
+
+Without this, **every model load will crash**. This must be in server.go.
+
+## llama-server
+
+| Detail | Value |
+|--------|-------|
+| **Binary** | `/usr/local/bin/llama-server` |
+| **Source** | `/home/claude/llama.cpp/` (commit `11c325c`) |
+| **Build flags** | `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release` |
+| **Version output** | `ggml_cuda_init: found 2 ROCm devices` (with HIP_VISIBLE_DEVICES=0: 1 device) |
+
+### Rebuild
+
+```bash
+cd /home/claude/llama.cpp
+git pull
+cmake -B build \
+    -DGGML_HIP=ON \
+    -DAMDGPU_TARGETS=gfx1100 \
+    -DGGML_HIP_ROCWMMA_FATTN=ON \
+    -DCMAKE_BUILD_TYPE=Release
+cmake --build build --parallel $(nproc) -t llama-server
+sudo cp build/bin/llama-server /usr/local/bin/llama-server
+```
+
+### Manual Test
+
+```bash
+HIP_VISIBLE_DEVICES=0 llama-server \
+    --model /data/lem/gguf/LEK-Gemma3-4B-Q4_K_M.gguf \
+    --port 8199 --n-gpu-layers 99 --ctx-size 4096
+
+# In another terminal:
+curl -s http://localhost:8199/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":32,"stream":false}' | jq .
+```
+
+## ROCm Verification
+
+```bash
+rocm-smi          # GPU status, temp, VRAM usage
+rocminfo          # Detailed GPU info (confirms gfx1100)
+/dev/kfd          # Must exist (ROCm kernel driver)
+```
+
+## Environment Variables
+
+| Variable | Required | Default | Purpose |
+|----------|----------|---------|---------|
+| `HIP_VISIBLE_DEVICES` | **YES** | Must set to `0` | Masks iGPU to prevent crash |
+| `ROCM_LLAMA_SERVER_PATH` | No | `llama-server` (PATH) | Custom binary path |
+| `ROCM_MODEL_DIR` | No | none | Default model directory |
+| `HSA_OVERRIDE_GFX_VERSION` | No | unset | Not needed (native gfx1100) |
--- a/Fleet-Context.md
+++ b/Fleet-Context.md
@ -0,0 +1,52 @@
+# Fleet Context
+
+How this repo fits in the wider Core Go agent fleet.
+
+## Your Role
+
+You are the **dedicated go-rocm domain expert**. You own this repo end-to-end.
+
+## The Contract
+
+| File | You Read | You Write |
+|------|----------|-----------|
+| `CLAUDE.md` | Yes — build commands, architecture, standards | No (Charon/Virgil maintains) |
+| `TODO.md` | Yes — pick up tasks in phase order | Yes — mark `[x]` when done |
+| `FINDINGS.md` | Yes — previous discoveries | Yes — add new findings |
+
+## Who's Who
+
+| Agent | Where | What They Do |
+|-------|-------|-------------|
+| **Virgil** | M3 Ultra (macOS) | Framework orchestrator, owns core/go, manages go-inference interfaces |
+| **Charon** | snider-linux (this machine) | Linux orchestrator, environment setup, plan review |
+| **go-mlx Claude** | M3 Ultra (macOS) | Sibling backend — Metal GPU inference, same TextModel interface |
+| **go-i18n Claude** | M3 Ultra (macOS) | Consumer — will use TextModel for batch classification |
+| **You** | snider-linux (this machine) | go-rocm implementation |
+
+## Dependencies
+
+```
+go-inference (shared interfaces) ← Virgil manages
+    ↑
+go-rocm (you implement this)
+    ↑
+go-ml (wraps both backends) ← Virgil creates backend_rocm.go when your API is ready
+    ↑
+go-ai (MCP hub) / go-i18n (classification)
+```
+
+## Communication
+
+- **New findings**: Write to `FINDINGS.md` in this repo
+- **Fleet-wide knowledge**: Push to the `core/go-agentic` wiki
+- **Cross-repo blockers**: Note in `TODO.md` with "Blocked on [package] [phase]"
+- **Questions for Virgil**: Note in `FINDINGS.md` with "QUESTION:" prefix
+
+## What You Don't Need to Know
+
+- How go-mlx works internally (different approach, same interface)
+- How the fleet delegation pattern works (just follow TODO.md)
+- Infrastructure details (Charon handles that)
+
+Focus on implementing the TODO.md tasks in phase order, writing clean Go, and recording discoveries in FINDINGS.md.
--- a/Home.md
+++ b/Home.md
@ -0,0 +1,29 @@
+# go-rocm
+
+AMD ROCm GPU inference backend for Linux. Implements `inference.Backend` and `inference.TextModel` from `core/go-inference` using llama.cpp's server mode with HIP/ROCm.
+
+## Quick Links
+
+- [[Environment]] — Hardware, ROCm, llama-server setup (validated 19 Feb 2026)
+- [[Architecture]] — How it works, design decisions, file layout
+- [[Interface-Contract]] — go-inference types this package must implement
+- [[Models]] — Available GGUF models and VRAM budget
+- [[Fleet-Context]] — How this repo fits in the wider agent fleet
+
+## Status
+
+| Phase | Status | Notes |
+|-------|--------|-------|
+| Phase 0: Environment | Done (Charon, 19 Feb) | ROCm 7.2, llama-server built, baselines recorded |
+| Phase 1: Core Implementation | Pending | GPU detection, server lifecycle, HTTP client, TextModel |
+| Phase 2: Robustness | Pending | Crash recovery, graceful shutdown, VRAM monitoring |
+| Phase 3: Model Support | Pending | GGUF discovery, chat templates, context sizing |
+| Phase 4: Performance | Pending | Benchmarks, flash attention, batch inference |
+
+## Module
+
+```
+forge.lthn.ai/core/go-rocm
+```
+
+Depends on: `forge.lthn.ai/core/go-inference` (shared interfaces, zero deps)
--- a/Interface-Contract.md
+++ b/Interface-Contract.md
@ -0,0 +1,103 @@
+# Interface Contract
+
+go-rocm must implement these interfaces from `forge.lthn.ai/core/go-inference`.
+
+## Backend
+
+```go
+type Backend interface {
+    Name() string                                          // Return "rocm"
+    LoadModel(path string, opts ...LoadOption) (TextModel, error)
+    Available() bool                                       // Check hardware + binary
+}
+```
+
+### Available() checks
+
+1. `/dev/kfd` exists (ROCm kernel driver loaded)
+2. `rocm-smi` detects a GPU (optional — may be slow)
+3. llama-server binary findable (PATH or `ROCM_LLAMA_SERVER_PATH` env)
+
+## TextModel
+
+```go
+type TextModel interface {
+    Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
+    Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
+    ModelType() string    // e.g. "gemma3", "qwen3", "llama3"
+    Err() error           // Check after iterator stops
+    Close() error         // SIGTERM llama-server, wait for exit
+}
+```
+
+### Key behaviours
+
+- `Generate()` and `Chat()` return `iter.Seq[Token]` — Go 1.23+ range-over-func
+- `iter.Seq` cannot carry errors — consumers must check `Err()` after the loop
+- `context.Context` enables cancellation (close SSE stream, don't kill server)
+- `Close()` sends SIGTERM to llama-server subprocess and waits for clean exit
+- `ModelType()` should parse from GGUF metadata or llama-server `/props` endpoint
+
+## Token
+
+```go
+type Token struct {
+    ID   int32
+    Text string
+}
+```
+
+## Message
+
+```go
+type Message struct {
+    Role    string // "system", "user", "assistant"
+    Content string
+}
+```
+
+## GenerateConfig (via options)
+
+```go
+type GenerateConfig struct {
+    MaxTokens     int       // Default: 256
+    Temperature   float32   // Default: 0.0 (greedy)
+    TopK          int
+    TopP          float32
+    StopTokens    []int32
+    RepeatPenalty float32
+}
+```
+
+Map these to llama-server's OpenAI-compatible API fields:
+- `MaxTokens` → `max_tokens`
+- `Temperature` → `temperature`
+- `TopK` → `top_k` (llama.cpp extension)
+- `TopP` → `top_p`
+- `RepeatPenalty` → `repeat_penalty` (llama.cpp extension)
+
+## LoadConfig (via options)
+
+```go
+type LoadConfig struct {
+    Backend    string  // "rocm" (or empty for auto)
+    ContextLen int     // → --ctx-size (0 = model default)
+    GPULayers  int     // → --n-gpu-layers (-1 = all)
+}
+```
+
+## Registration
+
+Already done in `register_rocm.go`:
+
+```go
+//go:build linux && amd64
+
+func init() {
+    inference.Register(&rocmBackend{})
+}
+```
+
+## Source
+
+The full interface code is at `/home/claude/Code/core/go-inference/inference.go` and `options.go`.
--- a/Models.md
+++ b/Models.md
@ -0,0 +1,66 @@
+# Models
+
+## VRAM Budget
+
+The RX 7800 XT has 16GB GDDR6. Budget:
+
+```
+Total VRAM:     16,384 MiB
+Model:          varies (see table)
+KV cache:       ~80-500 MiB (depends on context length + layers)
+Compute buffer: ~500 MiB
+Overhead:       ~200 MiB
+─────────────────────────────
+Available for model: ~15,000 MiB
+```
+
+## Available Models
+
+All at `/data/lem/gguf/` (SMB mount from M3 Ultra at `//10.69.69.108/Data`).
+
+| Model | File | Size | VRAM (est) | Fits? | Notes |
+|-------|------|------|------------|-------|-------|
+| Gemma3-1B Q5 | LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf | ~0.9G | ~1.5G | Yes | Smallest, good for quick tests |
+| Gemma3-1B Q8 | LEK-Gemma3-1B-layered-v2-Q8_0.gguf | ~1.4G | ~2G | Yes | Higher quality 1B |
+| Gemma3-4B Q4 | LEK-Gemma3-4B-Q4_K_M.gguf | 2.7G | ~3.4G | Yes | **Baseline tested: 109 tok/s** |
+| Gemma3-12B Q4 | LEK-Gemma3-12B-Q4_K_M.gguf | ~7.5G | ~9G | Yes | Good quality/speed balance |
+| Gemma3-27B Q4 | LEK-Gemma3-27B-Q4_K_M.gguf | ~16G | ~18G | Tight | May need reduced context |
+| Llama-3.1-8B Q4 | LEK-Llama-3.1-8B-Q4_K_M.gguf | ~5G | ~6G | Yes | |
+| Mistral-7B Q4 | LEK-Mistral-7B-v0.3-Q4_K_M.gguf | ~4G | ~5G | Yes | |
+| Qwen-2.5-7B Q4 | LEK-Qwen-2.5-7B-Q4_K_M.gguf | ~4G | ~5G | Yes | |
+
+## Recommended Test Model
+
+Use **LEK-Gemma3-4B-Q4_K_M.gguf** for development/testing:
+- Small enough to load fast (~6s)
+- Leaves plenty of VRAM headroom
+- Has validated baseline: 396 tok/s prefill, 109 tok/s decode
+- Chat template works natively in llama-server
+
+For integration tests that need to be fast, use **LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf**.
+
+## SMB Mount Note
+
+Models are on an SMB mount from the M3 Ultra. If the mount is unavailable:
+
+```go
+if _, err := os.Stat(modelPath); os.IsNotExist(err) {
+    t.Skip("model not available (SMB mount down?)")
+}
+```
+
+For CI, models would need to be copied locally or tests skipped.
+
+## Baseline Benchmarks
+
+Gemma3-4B-Q4_K_M on RX 7800 XT (Charon, 19 Feb 2026):
+
+| Metric | Value |
+|--------|-------|
+| Prefill | 396 tok/s (2.5ms/tok) |
+| Decode | 109 tok/s (9.2ms/tok) |
+| Time to first token | ~40ms (16 token prompt) |
+| Startup | ~6s (load + warmup) |
+| Context | 4096 (model supports 131072) |
+| Flash attention | Auto-enabled |
+| Concurrent slots | 4 |