docs: initial wiki — environment, architecture, interface contract, models, fleet context

Complete knowledge base for the dedicated go-rocm domain expert:
- Environment: hardware specs, iGPU crash trap, llama-server setup
- Architecture: subprocess design, file layout, coding standards
- Interface-Contract: go-inference types to implement with field mappings
- Models: GGUF inventory, VRAM budget, baseline benchmarks
- Fleet-Context: role, dependencies, communication patterns

Charon, 19 Feb 2026
Claude 2026-02-19 20:07:58 +00:00
commit cfc5aacefd
No known key found for this signature in database
GPG key ID: AF404715446AEB41
6 changed files with 411 additions and 0 deletions

74
Architecture.md Normal file

@ -0,0 +1,74 @@
# Architecture
## How It Works
```
Go process (go-rocm)
├─ Available() Checks: /dev/kfd + rocm-smi + llama-server binary
├─ LoadModel() Spawns llama-server subprocess
│ │
│ ├─ HIP_VISIBLE_DEVICES=0 (mask iGPU)
│ ├─ --model /path/to/model.gguf
│ ├─ --port <random free port>
│ ├─ --n-gpu-layers <from LoadConfig.GPULayers>
│ ├─ --ctx-size <from LoadConfig.ContextLen>
│ └─ Waits for /health endpoint → returns TextModel
├─ Generate() POST /v1/completions (SSE streaming)
├─ Chat() POST /v1/chat/completions (SSE streaming)
│ └─ Parses: data: {"choices":[{"delta":{"content":"..."}}]}
└─ Close() SIGTERM → wait for clean exit
```
## File Layout
```
go-rocm/
├── rocm.go Package doc (exists)
├── register_rocm.go //go:build linux && amd64 — init() registers backend (exists)
├── rocm_stub.go //go:build !linux || !amd64 — ROCmAvailable() false (exists)
├── backend.go inference.Backend: Name(), Available(), LoadModel() (exists, stubs)
├── model.go inference.TextModel: Generate(), Chat(), Err(), Close() (TODO Phase 1)
├── server.go llama-server lifecycle: spawn, health poll, shutdown (TODO Phase 1)
├── go.mod Module with go-inference dependency (exists)
├── CLAUDE.md Build/arch/standards contract (exists)
├── TODO.md Phased task queue (exists)
├── FINDINGS.md Research notes (exists)
└── internal/
└── llamacpp/
├── client.go OpenAI-compatible HTTP client with SSE parsing (TODO Phase 1)
└── health.go Health check + readiness probe (TODO Phase 1)
```
## Design Decisions
### Subprocess over CGO
llama-server (subprocess) was chosen over direct HIP CGO bindings because:
1. **Maturity**: llama-server has millions of users. CGO bindings would start from scratch.
2. **Model support**: 50+ architectures via GGUF. CGO would start with zero.
3. **Isolation**: GPU crash in subprocess doesn't kill the Go process.
4. **Maintenance**: llama.cpp team handles ROCm compat. We just build.
5. **Portability**: Same pattern works for CUDA (NVIDIA) and SYCL (Intel).
Trade-offs:
- ~50ms latency for first token (subprocess startup + model load)
- HTTP overhead vs in-process calls
- Can't share GPU memory between Go and llama-server
### Sibling: go-mlx
go-mlx (macOS) uses direct CGO because MLX is designed for embedding. Both implement the same `go-inference.TextModel` interface, allowing consumers to import either (or both) transparently.
## Coding Standards
- UK English
- Tests: `testify` assert/require
- Conventional commits: `feat:`, `fix:`, `test:`, `docs:`
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2
- Build tags: `//go:build linux && amd64` for GPU code, stub for other platforms

87
Environment.md Normal file

@ -0,0 +1,87 @@
# Environment
Validated by Charon, 19 Feb 2026.
## Hardware
| Component | Detail |
|-----------|--------|
| **Machine** | snider-linux (10.69.69.165) |
| **CPU** | AMD Ryzen 9 9950X (32 threads @ 5.7GHz) |
| **GPU (dGPU)** | AMD Radeon RX 7800 XT — gfx1100, RDNA 3, 16GB GDDR6, 60 CUs |
| **GPU (iGPU)** | AMD Radeon Graphics — gfx1100, integrated in CPU |
| **OS** | Ubuntu (Kernel 6.17.0-14-generic) |
| **ROCm** | 7.2.0 |
| **Go** | 1.25.6 |
## iGPU Crash — Critical
The Ryzen 9 9950X iGPU is detected by ROCm as Device 1. It reports ~100GB free memory (actually system RAM, not real VRAM). llama-server's auto-fit logic tries to split model tensors across both devices, causing:
```
ROCm error: unspecified launch failure
in function ggml_backend_cuda_buffer_set_tensor
```
This crashes with a core dump every time.
**Fix**: Always set `HIP_VISIBLE_DEVICES=0` before spawning llama-server:
```go
cmd.Env = append(os.Environ(), "HIP_VISIBLE_DEVICES=0")
```
Without this, **every model load will crash**. This must be in server.go.
## llama-server
| Detail | Value |
|--------|-------|
| **Binary** | `/usr/local/bin/llama-server` |
| **Source** | `/home/claude/llama.cpp/` (commit `11c325c`) |
| **Build flags** | `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release` |
| **Version output** | `ggml_cuda_init: found 2 ROCm devices` (with HIP_VISIBLE_DEVICES=0: 1 device) |
### Rebuild
```bash
cd /home/claude/llama.cpp
git pull
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1100 \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc) -t llama-server
sudo cp build/bin/llama-server /usr/local/bin/llama-server
```
### Manual Test
```bash
HIP_VISIBLE_DEVICES=0 llama-server \
--model /data/lem/gguf/LEK-Gemma3-4B-Q4_K_M.gguf \
--port 8199 --n-gpu-layers 99 --ctx-size 4096
# In another terminal:
curl -s http://localhost:8199/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":32,"stream":false}' | jq .
```
## ROCm Verification
```bash
rocm-smi # GPU status, temp, VRAM usage
rocminfo # Detailed GPU info (confirms gfx1100)
/dev/kfd # Must exist (ROCm kernel driver)
```
## Environment Variables
| Variable | Required | Default | Purpose |
|----------|----------|---------|---------|
| `HIP_VISIBLE_DEVICES` | **YES** | Must set to `0` | Masks iGPU to prevent crash |
| `ROCM_LLAMA_SERVER_PATH` | No | `llama-server` (PATH) | Custom binary path |
| `ROCM_MODEL_DIR` | No | none | Default model directory |
| `HSA_OVERRIDE_GFX_VERSION` | No | unset | Not needed (native gfx1100) |

52
Fleet-Context.md Normal file

@ -0,0 +1,52 @@
# Fleet Context
How this repo fits in the wider Core Go agent fleet.
## Your Role
You are the **dedicated go-rocm domain expert**. You own this repo end-to-end.
## The Contract
| File | You Read | You Write |
|------|----------|-----------|
| `CLAUDE.md` | Yes — build commands, architecture, standards | No (Charon/Virgil maintains) |
| `TODO.md` | Yes — pick up tasks in phase order | Yes — mark `[x]` when done |
| `FINDINGS.md` | Yes — previous discoveries | Yes — add new findings |
## Who's Who
| Agent | Where | What They Do |
|-------|-------|-------------|
| **Virgil** | M3 Ultra (macOS) | Framework orchestrator, owns core/go, manages go-inference interfaces |
| **Charon** | snider-linux (this machine) | Linux orchestrator, environment setup, plan review |
| **go-mlx Claude** | M3 Ultra (macOS) | Sibling backend — Metal GPU inference, same TextModel interface |
| **go-i18n Claude** | M3 Ultra (macOS) | Consumer — will use TextModel for batch classification |
| **You** | snider-linux (this machine) | go-rocm implementation |
## Dependencies
```
go-inference (shared interfaces) ← Virgil manages
go-rocm (you implement this)
go-ml (wraps both backends) ← Virgil creates backend_rocm.go when your API is ready
go-ai (MCP hub) / go-i18n (classification)
```
## Communication
- **New findings**: Write to `FINDINGS.md` in this repo
- **Fleet-wide knowledge**: Push to the `core/go-agentic` wiki
- **Cross-repo blockers**: Note in `TODO.md` with "Blocked on [package] [phase]"
- **Questions for Virgil**: Note in `FINDINGS.md` with "QUESTION:" prefix
## What You Don't Need to Know
- How go-mlx works internally (different approach, same interface)
- How the fleet delegation pattern works (just follow TODO.md)
- Infrastructure details (Charon handles that)
Focus on implementing the TODO.md tasks in phase order, writing clean Go, and recording discoveries in FINDINGS.md.

29
Home.md Normal file

@ -0,0 +1,29 @@
# go-rocm
AMD ROCm GPU inference backend for Linux. Implements `inference.Backend` and `inference.TextModel` from `core/go-inference` using llama.cpp's server mode with HIP/ROCm.
## Quick Links
- [[Environment]] — Hardware, ROCm, llama-server setup (validated 19 Feb 2026)
- [[Architecture]] — How it works, design decisions, file layout
- [[Interface-Contract]] — go-inference types this package must implement
- [[Models]] — Available GGUF models and VRAM budget
- [[Fleet-Context]] — How this repo fits in the wider agent fleet
## Status
| Phase | Status | Notes |
|-------|--------|-------|
| Phase 0: Environment | Done (Charon, 19 Feb) | ROCm 7.2, llama-server built, baselines recorded |
| Phase 1: Core Implementation | Pending | GPU detection, server lifecycle, HTTP client, TextModel |
| Phase 2: Robustness | Pending | Crash recovery, graceful shutdown, VRAM monitoring |
| Phase 3: Model Support | Pending | GGUF discovery, chat templates, context sizing |
| Phase 4: Performance | Pending | Benchmarks, flash attention, batch inference |
## Module
```
forge.lthn.ai/core/go-rocm
```
Depends on: `forge.lthn.ai/core/go-inference` (shared interfaces, zero deps)

103
Interface-Contract.md Normal file

@ -0,0 +1,103 @@
# Interface Contract
go-rocm must implement these interfaces from `forge.lthn.ai/core/go-inference`.
## Backend
```go
type Backend interface {
Name() string // Return "rocm"
LoadModel(path string, opts ...LoadOption) (TextModel, error)
Available() bool // Check hardware + binary
}
```
### Available() checks
1. `/dev/kfd` exists (ROCm kernel driver loaded)
2. `rocm-smi` detects a GPU (optional — may be slow)
3. llama-server binary findable (PATH or `ROCM_LLAMA_SERVER_PATH` env)
## TextModel
```go
type TextModel interface {
Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
ModelType() string // e.g. "gemma3", "qwen3", "llama3"
Err() error // Check after iterator stops
Close() error // SIGTERM llama-server, wait for exit
}
```
### Key behaviours
- `Generate()` and `Chat()` return `iter.Seq[Token]` — Go 1.23+ range-over-func
- `iter.Seq` cannot carry errors — consumers must check `Err()` after the loop
- `context.Context` enables cancellation (close SSE stream, don't kill server)
- `Close()` sends SIGTERM to llama-server subprocess and waits for clean exit
- `ModelType()` should parse from GGUF metadata or llama-server `/props` endpoint
## Token
```go
type Token struct {
ID int32
Text string
}
```
## Message
```go
type Message struct {
Role string // "system", "user", "assistant"
Content string
}
```
## GenerateConfig (via options)
```go
type GenerateConfig struct {
MaxTokens int // Default: 256
Temperature float32 // Default: 0.0 (greedy)
TopK int
TopP float32
StopTokens []int32
RepeatPenalty float32
}
```
Map these to llama-server's OpenAI-compatible API fields:
- `MaxTokens``max_tokens`
- `Temperature``temperature`
- `TopK``top_k` (llama.cpp extension)
- `TopP``top_p`
- `RepeatPenalty``repeat_penalty` (llama.cpp extension)
## LoadConfig (via options)
```go
type LoadConfig struct {
Backend string // "rocm" (or empty for auto)
ContextLen int // → --ctx-size (0 = model default)
GPULayers int // → --n-gpu-layers (-1 = all)
}
```
## Registration
Already done in `register_rocm.go`:
```go
//go:build linux && amd64
func init() {
inference.Register(&rocmBackend{})
}
```
## Source
The full interface code is at `/home/claude/Code/core/go-inference/inference.go` and `options.go`.

66
Models.md Normal file

@ -0,0 +1,66 @@
# Models
## VRAM Budget
The RX 7800 XT has 16GB GDDR6. Budget:
```
Total VRAM: 16,384 MiB
Model: varies (see table)
KV cache: ~80-500 MiB (depends on context length + layers)
Compute buffer: ~500 MiB
Overhead: ~200 MiB
─────────────────────────────
Available for model: ~15,000 MiB
```
## Available Models
All at `/data/lem/gguf/` (SMB mount from M3 Ultra at `//10.69.69.108/Data`).
| Model | File | Size | VRAM (est) | Fits? | Notes |
|-------|------|------|------------|-------|-------|
| Gemma3-1B Q5 | LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf | ~0.9G | ~1.5G | Yes | Smallest, good for quick tests |
| Gemma3-1B Q8 | LEK-Gemma3-1B-layered-v2-Q8_0.gguf | ~1.4G | ~2G | Yes | Higher quality 1B |
| Gemma3-4B Q4 | LEK-Gemma3-4B-Q4_K_M.gguf | 2.7G | ~3.4G | Yes | **Baseline tested: 109 tok/s** |
| Gemma3-12B Q4 | LEK-Gemma3-12B-Q4_K_M.gguf | ~7.5G | ~9G | Yes | Good quality/speed balance |
| Gemma3-27B Q4 | LEK-Gemma3-27B-Q4_K_M.gguf | ~16G | ~18G | Tight | May need reduced context |
| Llama-3.1-8B Q4 | LEK-Llama-3.1-8B-Q4_K_M.gguf | ~5G | ~6G | Yes | |
| Mistral-7B Q4 | LEK-Mistral-7B-v0.3-Q4_K_M.gguf | ~4G | ~5G | Yes | |
| Qwen-2.5-7B Q4 | LEK-Qwen-2.5-7B-Q4_K_M.gguf | ~4G | ~5G | Yes | |
## Recommended Test Model
Use **LEK-Gemma3-4B-Q4_K_M.gguf** for development/testing:
- Small enough to load fast (~6s)
- Leaves plenty of VRAM headroom
- Has validated baseline: 396 tok/s prefill, 109 tok/s decode
- Chat template works natively in llama-server
For integration tests that need to be fast, use **LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf**.
## SMB Mount Note
Models are on an SMB mount from the M3 Ultra. If the mount is unavailable:
```go
if _, err := os.Stat(modelPath); os.IsNotExist(err) {
t.Skip("model not available (SMB mount down?)")
}
```
For CI, models would need to be copied locally or tests skipped.
## Baseline Benchmarks
Gemma3-4B-Q4_K_M on RX 7800 XT (Charon, 19 Feb 2026):
| Metric | Value |
|--------|-------|
| Prefill | 396 tok/s (2.5ms/tok) |
| Decode | 109 tok/s (9.2ms/tok) |
| Time to first token | ~40ms (16 token prompt) |
| Startup | ~6s (load + warmup) |
| Context | 4096 (model supports 131072) |
| Flash attention | Auto-enabled |
| Concurrent slots | 4 |