commit cfc5aacefdae75950711e133c1c0ad1c4a84b3a7 Author: Claude Date: Thu Feb 19 20:07:58 2026 +0000 docs: initial wiki — environment, architecture, interface contract, models, fleet context Complete knowledge base for the dedicated go-rocm domain expert: - Environment: hardware specs, iGPU crash trap, llama-server setup - Architecture: subprocess design, file layout, coding standards - Interface-Contract: go-inference types to implement with field mappings - Models: GGUF inventory, VRAM budget, baseline benchmarks - Fleet-Context: role, dependencies, communication patterns Charon, 19 Feb 2026 diff --git a/Architecture.md b/Architecture.md new file mode 100644 index 0000000..f8213f7 --- /dev/null +++ b/Architecture.md @@ -0,0 +1,74 @@ +# Architecture + +## How It Works + +``` +Go process (go-rocm) + │ + ├─ Available() Checks: /dev/kfd + rocm-smi + llama-server binary + │ + ├─ LoadModel() Spawns llama-server subprocess + │ │ + │ ├─ HIP_VISIBLE_DEVICES=0 (mask iGPU) + │ ├─ --model /path/to/model.gguf + │ ├─ --port + │ ├─ --n-gpu-layers + │ ├─ --ctx-size + │ └─ Waits for /health endpoint → returns TextModel + │ + ├─ Generate() POST /v1/completions (SSE streaming) + ├─ Chat() POST /v1/chat/completions (SSE streaming) + │ └─ Parses: data: {"choices":[{"delta":{"content":"..."}}]} + │ + └─ Close() SIGTERM → wait for clean exit +``` + +## File Layout + +``` +go-rocm/ +├── rocm.go Package doc (exists) +├── register_rocm.go //go:build linux && amd64 — init() registers backend (exists) +├── rocm_stub.go //go:build !linux || !amd64 — ROCmAvailable() false (exists) +├── backend.go inference.Backend: Name(), Available(), LoadModel() (exists, stubs) +├── model.go inference.TextModel: Generate(), Chat(), Err(), Close() (TODO Phase 1) +├── server.go llama-server lifecycle: spawn, health poll, shutdown (TODO Phase 1) +├── go.mod Module with go-inference dependency (exists) +├── CLAUDE.md Build/arch/standards contract (exists) +├── TODO.md Phased task queue (exists) +├── FINDINGS.md Research notes (exists) +└── internal/ + └── llamacpp/ + ├── client.go OpenAI-compatible HTTP client with SSE parsing (TODO Phase 1) + └── health.go Health check + readiness probe (TODO Phase 1) +``` + +## Design Decisions + +### Subprocess over CGO + +llama-server (subprocess) was chosen over direct HIP CGO bindings because: + +1. **Maturity**: llama-server has millions of users. CGO bindings would start from scratch. +2. **Model support**: 50+ architectures via GGUF. CGO would start with zero. +3. **Isolation**: GPU crash in subprocess doesn't kill the Go process. +4. **Maintenance**: llama.cpp team handles ROCm compat. We just build. +5. **Portability**: Same pattern works for CUDA (NVIDIA) and SYCL (Intel). + +Trade-offs: +- ~50ms latency for first token (subprocess startup + model load) +- HTTP overhead vs in-process calls +- Can't share GPU memory between Go and llama-server + +### Sibling: go-mlx + +go-mlx (macOS) uses direct CGO because MLX is designed for embedding. Both implement the same `go-inference.TextModel` interface, allowing consumers to import either (or both) transparently. + +## Coding Standards + +- UK English +- Tests: `testify` assert/require +- Conventional commits: `feat:`, `fix:`, `test:`, `docs:` +- Co-Author: `Co-Authored-By: Virgil ` +- Licence: EUPL-1.2 +- Build tags: `//go:build linux && amd64` for GPU code, stub for other platforms diff --git a/Environment.md b/Environment.md new file mode 100644 index 0000000..bc76f62 --- /dev/null +++ b/Environment.md @@ -0,0 +1,87 @@ +# Environment + +Validated by Charon, 19 Feb 2026. + +## Hardware + +| Component | Detail | +|-----------|--------| +| **Machine** | snider-linux (10.69.69.165) | +| **CPU** | AMD Ryzen 9 9950X (32 threads @ 5.7GHz) | +| **GPU (dGPU)** | AMD Radeon RX 7800 XT — gfx1100, RDNA 3, 16GB GDDR6, 60 CUs | +| **GPU (iGPU)** | AMD Radeon Graphics — gfx1100, integrated in CPU | +| **OS** | Ubuntu (Kernel 6.17.0-14-generic) | +| **ROCm** | 7.2.0 | +| **Go** | 1.25.6 | + +## iGPU Crash — Critical + +The Ryzen 9 9950X iGPU is detected by ROCm as Device 1. It reports ~100GB free memory (actually system RAM, not real VRAM). llama-server's auto-fit logic tries to split model tensors across both devices, causing: + +``` +ROCm error: unspecified launch failure + in function ggml_backend_cuda_buffer_set_tensor +``` + +This crashes with a core dump every time. + +**Fix**: Always set `HIP_VISIBLE_DEVICES=0` before spawning llama-server: + +```go +cmd.Env = append(os.Environ(), "HIP_VISIBLE_DEVICES=0") +``` + +Without this, **every model load will crash**. This must be in server.go. + +## llama-server + +| Detail | Value | +|--------|-------| +| **Binary** | `/usr/local/bin/llama-server` | +| **Source** | `/home/claude/llama.cpp/` (commit `11c325c`) | +| **Build flags** | `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release` | +| **Version output** | `ggml_cuda_init: found 2 ROCm devices` (with HIP_VISIBLE_DEVICES=0: 1 device) | + +### Rebuild + +```bash +cd /home/claude/llama.cpp +git pull +cmake -B build \ + -DGGML_HIP=ON \ + -DAMDGPU_TARGETS=gfx1100 \ + -DGGML_HIP_ROCWMMA_FATTN=ON \ + -DCMAKE_BUILD_TYPE=Release +cmake --build build --parallel $(nproc) -t llama-server +sudo cp build/bin/llama-server /usr/local/bin/llama-server +``` + +### Manual Test + +```bash +HIP_VISIBLE_DEVICES=0 llama-server \ + --model /data/lem/gguf/LEK-Gemma3-4B-Q4_K_M.gguf \ + --port 8199 --n-gpu-layers 99 --ctx-size 4096 + +# In another terminal: +curl -s http://localhost:8199/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":32,"stream":false}' | jq . +``` + +## ROCm Verification + +```bash +rocm-smi # GPU status, temp, VRAM usage +rocminfo # Detailed GPU info (confirms gfx1100) +/dev/kfd # Must exist (ROCm kernel driver) +``` + +## Environment Variables + +| Variable | Required | Default | Purpose | +|----------|----------|---------|---------| +| `HIP_VISIBLE_DEVICES` | **YES** | Must set to `0` | Masks iGPU to prevent crash | +| `ROCM_LLAMA_SERVER_PATH` | No | `llama-server` (PATH) | Custom binary path | +| `ROCM_MODEL_DIR` | No | none | Default model directory | +| `HSA_OVERRIDE_GFX_VERSION` | No | unset | Not needed (native gfx1100) | diff --git a/Fleet-Context.md b/Fleet-Context.md new file mode 100644 index 0000000..2ab48b7 --- /dev/null +++ b/Fleet-Context.md @@ -0,0 +1,52 @@ +# Fleet Context + +How this repo fits in the wider Core Go agent fleet. + +## Your Role + +You are the **dedicated go-rocm domain expert**. You own this repo end-to-end. + +## The Contract + +| File | You Read | You Write | +|------|----------|-----------| +| `CLAUDE.md` | Yes — build commands, architecture, standards | No (Charon/Virgil maintains) | +| `TODO.md` | Yes — pick up tasks in phase order | Yes — mark `[x]` when done | +| `FINDINGS.md` | Yes — previous discoveries | Yes — add new findings | + +## Who's Who + +| Agent | Where | What They Do | +|-------|-------|-------------| +| **Virgil** | M3 Ultra (macOS) | Framework orchestrator, owns core/go, manages go-inference interfaces | +| **Charon** | snider-linux (this machine) | Linux orchestrator, environment setup, plan review | +| **go-mlx Claude** | M3 Ultra (macOS) | Sibling backend — Metal GPU inference, same TextModel interface | +| **go-i18n Claude** | M3 Ultra (macOS) | Consumer — will use TextModel for batch classification | +| **You** | snider-linux (this machine) | go-rocm implementation | + +## Dependencies + +``` +go-inference (shared interfaces) ← Virgil manages + ↑ +go-rocm (you implement this) + ↑ +go-ml (wraps both backends) ← Virgil creates backend_rocm.go when your API is ready + ↑ +go-ai (MCP hub) / go-i18n (classification) +``` + +## Communication + +- **New findings**: Write to `FINDINGS.md` in this repo +- **Fleet-wide knowledge**: Push to the `core/go-agentic` wiki +- **Cross-repo blockers**: Note in `TODO.md` with "Blocked on [package] [phase]" +- **Questions for Virgil**: Note in `FINDINGS.md` with "QUESTION:" prefix + +## What You Don't Need to Know + +- How go-mlx works internally (different approach, same interface) +- How the fleet delegation pattern works (just follow TODO.md) +- Infrastructure details (Charon handles that) + +Focus on implementing the TODO.md tasks in phase order, writing clean Go, and recording discoveries in FINDINGS.md. diff --git a/Home.md b/Home.md new file mode 100644 index 0000000..c827086 --- /dev/null +++ b/Home.md @@ -0,0 +1,29 @@ +# go-rocm + +AMD ROCm GPU inference backend for Linux. Implements `inference.Backend` and `inference.TextModel` from `core/go-inference` using llama.cpp's server mode with HIP/ROCm. + +## Quick Links + +- [[Environment]] — Hardware, ROCm, llama-server setup (validated 19 Feb 2026) +- [[Architecture]] — How it works, design decisions, file layout +- [[Interface-Contract]] — go-inference types this package must implement +- [[Models]] — Available GGUF models and VRAM budget +- [[Fleet-Context]] — How this repo fits in the wider agent fleet + +## Status + +| Phase | Status | Notes | +|-------|--------|-------| +| Phase 0: Environment | Done (Charon, 19 Feb) | ROCm 7.2, llama-server built, baselines recorded | +| Phase 1: Core Implementation | Pending | GPU detection, server lifecycle, HTTP client, TextModel | +| Phase 2: Robustness | Pending | Crash recovery, graceful shutdown, VRAM monitoring | +| Phase 3: Model Support | Pending | GGUF discovery, chat templates, context sizing | +| Phase 4: Performance | Pending | Benchmarks, flash attention, batch inference | + +## Module + +``` +forge.lthn.ai/core/go-rocm +``` + +Depends on: `forge.lthn.ai/core/go-inference` (shared interfaces, zero deps) diff --git a/Interface-Contract.md b/Interface-Contract.md new file mode 100644 index 0000000..2c14bb8 --- /dev/null +++ b/Interface-Contract.md @@ -0,0 +1,103 @@ +# Interface Contract + +go-rocm must implement these interfaces from `forge.lthn.ai/core/go-inference`. + +## Backend + +```go +type Backend interface { + Name() string // Return "rocm" + LoadModel(path string, opts ...LoadOption) (TextModel, error) + Available() bool // Check hardware + binary +} +``` + +### Available() checks + +1. `/dev/kfd` exists (ROCm kernel driver loaded) +2. `rocm-smi` detects a GPU (optional — may be slow) +3. llama-server binary findable (PATH or `ROCM_LLAMA_SERVER_PATH` env) + +## TextModel + +```go +type TextModel interface { + Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token] + Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token] + ModelType() string // e.g. "gemma3", "qwen3", "llama3" + Err() error // Check after iterator stops + Close() error // SIGTERM llama-server, wait for exit +} +``` + +### Key behaviours + +- `Generate()` and `Chat()` return `iter.Seq[Token]` — Go 1.23+ range-over-func +- `iter.Seq` cannot carry errors — consumers must check `Err()` after the loop +- `context.Context` enables cancellation (close SSE stream, don't kill server) +- `Close()` sends SIGTERM to llama-server subprocess and waits for clean exit +- `ModelType()` should parse from GGUF metadata or llama-server `/props` endpoint + +## Token + +```go +type Token struct { + ID int32 + Text string +} +``` + +## Message + +```go +type Message struct { + Role string // "system", "user", "assistant" + Content string +} +``` + +## GenerateConfig (via options) + +```go +type GenerateConfig struct { + MaxTokens int // Default: 256 + Temperature float32 // Default: 0.0 (greedy) + TopK int + TopP float32 + StopTokens []int32 + RepeatPenalty float32 +} +``` + +Map these to llama-server's OpenAI-compatible API fields: +- `MaxTokens` → `max_tokens` +- `Temperature` → `temperature` +- `TopK` → `top_k` (llama.cpp extension) +- `TopP` → `top_p` +- `RepeatPenalty` → `repeat_penalty` (llama.cpp extension) + +## LoadConfig (via options) + +```go +type LoadConfig struct { + Backend string // "rocm" (or empty for auto) + ContextLen int // → --ctx-size (0 = model default) + GPULayers int // → --n-gpu-layers (-1 = all) +} +``` + +## Registration + +Already done in `register_rocm.go`: + +```go +//go:build linux && amd64 + +func init() { + inference.Register(&rocmBackend{}) +} +``` + +## Source + +The full interface code is at `/home/claude/Code/core/go-inference/inference.go` and `options.go`. diff --git a/Models.md b/Models.md new file mode 100644 index 0000000..e7e8c62 --- /dev/null +++ b/Models.md @@ -0,0 +1,66 @@ +# Models + +## VRAM Budget + +The RX 7800 XT has 16GB GDDR6. Budget: + +``` +Total VRAM: 16,384 MiB +Model: varies (see table) +KV cache: ~80-500 MiB (depends on context length + layers) +Compute buffer: ~500 MiB +Overhead: ~200 MiB +───────────────────────────── +Available for model: ~15,000 MiB +``` + +## Available Models + +All at `/data/lem/gguf/` (SMB mount from M3 Ultra at `//10.69.69.108/Data`). + +| Model | File | Size | VRAM (est) | Fits? | Notes | +|-------|------|------|------------|-------|-------| +| Gemma3-1B Q5 | LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf | ~0.9G | ~1.5G | Yes | Smallest, good for quick tests | +| Gemma3-1B Q8 | LEK-Gemma3-1B-layered-v2-Q8_0.gguf | ~1.4G | ~2G | Yes | Higher quality 1B | +| Gemma3-4B Q4 | LEK-Gemma3-4B-Q4_K_M.gguf | 2.7G | ~3.4G | Yes | **Baseline tested: 109 tok/s** | +| Gemma3-12B Q4 | LEK-Gemma3-12B-Q4_K_M.gguf | ~7.5G | ~9G | Yes | Good quality/speed balance | +| Gemma3-27B Q4 | LEK-Gemma3-27B-Q4_K_M.gguf | ~16G | ~18G | Tight | May need reduced context | +| Llama-3.1-8B Q4 | LEK-Llama-3.1-8B-Q4_K_M.gguf | ~5G | ~6G | Yes | | +| Mistral-7B Q4 | LEK-Mistral-7B-v0.3-Q4_K_M.gguf | ~4G | ~5G | Yes | | +| Qwen-2.5-7B Q4 | LEK-Qwen-2.5-7B-Q4_K_M.gguf | ~4G | ~5G | Yes | | + +## Recommended Test Model + +Use **LEK-Gemma3-4B-Q4_K_M.gguf** for development/testing: +- Small enough to load fast (~6s) +- Leaves plenty of VRAM headroom +- Has validated baseline: 396 tok/s prefill, 109 tok/s decode +- Chat template works natively in llama-server + +For integration tests that need to be fast, use **LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf**. + +## SMB Mount Note + +Models are on an SMB mount from the M3 Ultra. If the mount is unavailable: + +```go +if _, err := os.Stat(modelPath); os.IsNotExist(err) { + t.Skip("model not available (SMB mount down?)") +} +``` + +For CI, models would need to be copied locally or tests skipped. + +## Baseline Benchmarks + +Gemma3-4B-Q4_K_M on RX 7800 XT (Charon, 19 Feb 2026): + +| Metric | Value | +|--------|-------| +| Prefill | 396 tok/s (2.5ms/tok) | +| Decode | 109 tok/s (9.2ms/tok) | +| Time to first token | ~40ms (16 token prompt) | +| Startup | ~6s (load + warmup) | +| Context | 4096 (model supports 131072) | +| Flash attention | Auto-enabled | +| Concurrent slots | 4 |