docs: initial wiki — environment, architecture, interface contract, models, fleet context
Complete knowledge base for the dedicated go-rocm domain expert: - Environment: hardware specs, iGPU crash trap, llama-server setup - Architecture: subprocess design, file layout, coding standards - Interface-Contract: go-inference types to implement with field mappings - Models: GGUF inventory, VRAM budget, baseline benchmarks - Fleet-Context: role, dependencies, communication patterns Charon, 19 Feb 2026
commit
cfc5aacefd
6 changed files with 411 additions and 0 deletions
74
Architecture.md
Normal file
74
Architecture.md
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
# Architecture
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
Go process (go-rocm)
|
||||
│
|
||||
├─ Available() Checks: /dev/kfd + rocm-smi + llama-server binary
|
||||
│
|
||||
├─ LoadModel() Spawns llama-server subprocess
|
||||
│ │
|
||||
│ ├─ HIP_VISIBLE_DEVICES=0 (mask iGPU)
|
||||
│ ├─ --model /path/to/model.gguf
|
||||
│ ├─ --port <random free port>
|
||||
│ ├─ --n-gpu-layers <from LoadConfig.GPULayers>
|
||||
│ ├─ --ctx-size <from LoadConfig.ContextLen>
|
||||
│ └─ Waits for /health endpoint → returns TextModel
|
||||
│
|
||||
├─ Generate() POST /v1/completions (SSE streaming)
|
||||
├─ Chat() POST /v1/chat/completions (SSE streaming)
|
||||
│ └─ Parses: data: {"choices":[{"delta":{"content":"..."}}]}
|
||||
│
|
||||
└─ Close() SIGTERM → wait for clean exit
|
||||
```
|
||||
|
||||
## File Layout
|
||||
|
||||
```
|
||||
go-rocm/
|
||||
├── rocm.go Package doc (exists)
|
||||
├── register_rocm.go //go:build linux && amd64 — init() registers backend (exists)
|
||||
├── rocm_stub.go //go:build !linux || !amd64 — ROCmAvailable() false (exists)
|
||||
├── backend.go inference.Backend: Name(), Available(), LoadModel() (exists, stubs)
|
||||
├── model.go inference.TextModel: Generate(), Chat(), Err(), Close() (TODO Phase 1)
|
||||
├── server.go llama-server lifecycle: spawn, health poll, shutdown (TODO Phase 1)
|
||||
├── go.mod Module with go-inference dependency (exists)
|
||||
├── CLAUDE.md Build/arch/standards contract (exists)
|
||||
├── TODO.md Phased task queue (exists)
|
||||
├── FINDINGS.md Research notes (exists)
|
||||
└── internal/
|
||||
└── llamacpp/
|
||||
├── client.go OpenAI-compatible HTTP client with SSE parsing (TODO Phase 1)
|
||||
└── health.go Health check + readiness probe (TODO Phase 1)
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Subprocess over CGO
|
||||
|
||||
llama-server (subprocess) was chosen over direct HIP CGO bindings because:
|
||||
|
||||
1. **Maturity**: llama-server has millions of users. CGO bindings would start from scratch.
|
||||
2. **Model support**: 50+ architectures via GGUF. CGO would start with zero.
|
||||
3. **Isolation**: GPU crash in subprocess doesn't kill the Go process.
|
||||
4. **Maintenance**: llama.cpp team handles ROCm compat. We just build.
|
||||
5. **Portability**: Same pattern works for CUDA (NVIDIA) and SYCL (Intel).
|
||||
|
||||
Trade-offs:
|
||||
- ~50ms latency for first token (subprocess startup + model load)
|
||||
- HTTP overhead vs in-process calls
|
||||
- Can't share GPU memory between Go and llama-server
|
||||
|
||||
### Sibling: go-mlx
|
||||
|
||||
go-mlx (macOS) uses direct CGO because MLX is designed for embedding. Both implement the same `go-inference.TextModel` interface, allowing consumers to import either (or both) transparently.
|
||||
|
||||
## Coding Standards
|
||||
|
||||
- UK English
|
||||
- Tests: `testify` assert/require
|
||||
- Conventional commits: `feat:`, `fix:`, `test:`, `docs:`
|
||||
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
|
||||
- Licence: EUPL-1.2
|
||||
- Build tags: `//go:build linux && amd64` for GPU code, stub for other platforms
|
||||
87
Environment.md
Normal file
87
Environment.md
Normal file
|
|
@ -0,0 +1,87 @@
|
|||
# Environment
|
||||
|
||||
Validated by Charon, 19 Feb 2026.
|
||||
|
||||
## Hardware
|
||||
|
||||
| Component | Detail |
|
||||
|-----------|--------|
|
||||
| **Machine** | snider-linux (10.69.69.165) |
|
||||
| **CPU** | AMD Ryzen 9 9950X (32 threads @ 5.7GHz) |
|
||||
| **GPU (dGPU)** | AMD Radeon RX 7800 XT — gfx1100, RDNA 3, 16GB GDDR6, 60 CUs |
|
||||
| **GPU (iGPU)** | AMD Radeon Graphics — gfx1100, integrated in CPU |
|
||||
| **OS** | Ubuntu (Kernel 6.17.0-14-generic) |
|
||||
| **ROCm** | 7.2.0 |
|
||||
| **Go** | 1.25.6 |
|
||||
|
||||
## iGPU Crash — Critical
|
||||
|
||||
The Ryzen 9 9950X iGPU is detected by ROCm as Device 1. It reports ~100GB free memory (actually system RAM, not real VRAM). llama-server's auto-fit logic tries to split model tensors across both devices, causing:
|
||||
|
||||
```
|
||||
ROCm error: unspecified launch failure
|
||||
in function ggml_backend_cuda_buffer_set_tensor
|
||||
```
|
||||
|
||||
This crashes with a core dump every time.
|
||||
|
||||
**Fix**: Always set `HIP_VISIBLE_DEVICES=0` before spawning llama-server:
|
||||
|
||||
```go
|
||||
cmd.Env = append(os.Environ(), "HIP_VISIBLE_DEVICES=0")
|
||||
```
|
||||
|
||||
Without this, **every model load will crash**. This must be in server.go.
|
||||
|
||||
## llama-server
|
||||
|
||||
| Detail | Value |
|
||||
|--------|-------|
|
||||
| **Binary** | `/usr/local/bin/llama-server` |
|
||||
| **Source** | `/home/claude/llama.cpp/` (commit `11c325c`) |
|
||||
| **Build flags** | `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release` |
|
||||
| **Version output** | `ggml_cuda_init: found 2 ROCm devices` (with HIP_VISIBLE_DEVICES=0: 1 device) |
|
||||
|
||||
### Rebuild
|
||||
|
||||
```bash
|
||||
cd /home/claude/llama.cpp
|
||||
git pull
|
||||
cmake -B build \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS=gfx1100 \
|
||||
-DGGML_HIP_ROCWMMA_FATTN=ON \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --parallel $(nproc) -t llama-server
|
||||
sudo cp build/bin/llama-server /usr/local/bin/llama-server
|
||||
```
|
||||
|
||||
### Manual Test
|
||||
|
||||
```bash
|
||||
HIP_VISIBLE_DEVICES=0 llama-server \
|
||||
--model /data/lem/gguf/LEK-Gemma3-4B-Q4_K_M.gguf \
|
||||
--port 8199 --n-gpu-layers 99 --ctx-size 4096
|
||||
|
||||
# In another terminal:
|
||||
curl -s http://localhost:8199/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":32,"stream":false}' | jq .
|
||||
```
|
||||
|
||||
## ROCm Verification
|
||||
|
||||
```bash
|
||||
rocm-smi # GPU status, temp, VRAM usage
|
||||
rocminfo # Detailed GPU info (confirms gfx1100)
|
||||
/dev/kfd # Must exist (ROCm kernel driver)
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Required | Default | Purpose |
|
||||
|----------|----------|---------|---------|
|
||||
| `HIP_VISIBLE_DEVICES` | **YES** | Must set to `0` | Masks iGPU to prevent crash |
|
||||
| `ROCM_LLAMA_SERVER_PATH` | No | `llama-server` (PATH) | Custom binary path |
|
||||
| `ROCM_MODEL_DIR` | No | none | Default model directory |
|
||||
| `HSA_OVERRIDE_GFX_VERSION` | No | unset | Not needed (native gfx1100) |
|
||||
52
Fleet-Context.md
Normal file
52
Fleet-Context.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# Fleet Context
|
||||
|
||||
How this repo fits in the wider Core Go agent fleet.
|
||||
|
||||
## Your Role
|
||||
|
||||
You are the **dedicated go-rocm domain expert**. You own this repo end-to-end.
|
||||
|
||||
## The Contract
|
||||
|
||||
| File | You Read | You Write |
|
||||
|------|----------|-----------|
|
||||
| `CLAUDE.md` | Yes — build commands, architecture, standards | No (Charon/Virgil maintains) |
|
||||
| `TODO.md` | Yes — pick up tasks in phase order | Yes — mark `[x]` when done |
|
||||
| `FINDINGS.md` | Yes — previous discoveries | Yes — add new findings |
|
||||
|
||||
## Who's Who
|
||||
|
||||
| Agent | Where | What They Do |
|
||||
|-------|-------|-------------|
|
||||
| **Virgil** | M3 Ultra (macOS) | Framework orchestrator, owns core/go, manages go-inference interfaces |
|
||||
| **Charon** | snider-linux (this machine) | Linux orchestrator, environment setup, plan review |
|
||||
| **go-mlx Claude** | M3 Ultra (macOS) | Sibling backend — Metal GPU inference, same TextModel interface |
|
||||
| **go-i18n Claude** | M3 Ultra (macOS) | Consumer — will use TextModel for batch classification |
|
||||
| **You** | snider-linux (this machine) | go-rocm implementation |
|
||||
|
||||
## Dependencies
|
||||
|
||||
```
|
||||
go-inference (shared interfaces) ← Virgil manages
|
||||
↑
|
||||
go-rocm (you implement this)
|
||||
↑
|
||||
go-ml (wraps both backends) ← Virgil creates backend_rocm.go when your API is ready
|
||||
↑
|
||||
go-ai (MCP hub) / go-i18n (classification)
|
||||
```
|
||||
|
||||
## Communication
|
||||
|
||||
- **New findings**: Write to `FINDINGS.md` in this repo
|
||||
- **Fleet-wide knowledge**: Push to the `core/go-agentic` wiki
|
||||
- **Cross-repo blockers**: Note in `TODO.md` with "Blocked on [package] [phase]"
|
||||
- **Questions for Virgil**: Note in `FINDINGS.md` with "QUESTION:" prefix
|
||||
|
||||
## What You Don't Need to Know
|
||||
|
||||
- How go-mlx works internally (different approach, same interface)
|
||||
- How the fleet delegation pattern works (just follow TODO.md)
|
||||
- Infrastructure details (Charon handles that)
|
||||
|
||||
Focus on implementing the TODO.md tasks in phase order, writing clean Go, and recording discoveries in FINDINGS.md.
|
||||
29
Home.md
Normal file
29
Home.md
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
# go-rocm
|
||||
|
||||
AMD ROCm GPU inference backend for Linux. Implements `inference.Backend` and `inference.TextModel` from `core/go-inference` using llama.cpp's server mode with HIP/ROCm.
|
||||
|
||||
## Quick Links
|
||||
|
||||
- [[Environment]] — Hardware, ROCm, llama-server setup (validated 19 Feb 2026)
|
||||
- [[Architecture]] — How it works, design decisions, file layout
|
||||
- [[Interface-Contract]] — go-inference types this package must implement
|
||||
- [[Models]] — Available GGUF models and VRAM budget
|
||||
- [[Fleet-Context]] — How this repo fits in the wider agent fleet
|
||||
|
||||
## Status
|
||||
|
||||
| Phase | Status | Notes |
|
||||
|-------|--------|-------|
|
||||
| Phase 0: Environment | Done (Charon, 19 Feb) | ROCm 7.2, llama-server built, baselines recorded |
|
||||
| Phase 1: Core Implementation | Pending | GPU detection, server lifecycle, HTTP client, TextModel |
|
||||
| Phase 2: Robustness | Pending | Crash recovery, graceful shutdown, VRAM monitoring |
|
||||
| Phase 3: Model Support | Pending | GGUF discovery, chat templates, context sizing |
|
||||
| Phase 4: Performance | Pending | Benchmarks, flash attention, batch inference |
|
||||
|
||||
## Module
|
||||
|
||||
```
|
||||
forge.lthn.ai/core/go-rocm
|
||||
```
|
||||
|
||||
Depends on: `forge.lthn.ai/core/go-inference` (shared interfaces, zero deps)
|
||||
103
Interface-Contract.md
Normal file
103
Interface-Contract.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
# Interface Contract
|
||||
|
||||
go-rocm must implement these interfaces from `forge.lthn.ai/core/go-inference`.
|
||||
|
||||
## Backend
|
||||
|
||||
```go
|
||||
type Backend interface {
|
||||
Name() string // Return "rocm"
|
||||
LoadModel(path string, opts ...LoadOption) (TextModel, error)
|
||||
Available() bool // Check hardware + binary
|
||||
}
|
||||
```
|
||||
|
||||
### Available() checks
|
||||
|
||||
1. `/dev/kfd` exists (ROCm kernel driver loaded)
|
||||
2. `rocm-smi` detects a GPU (optional — may be slow)
|
||||
3. llama-server binary findable (PATH or `ROCM_LLAMA_SERVER_PATH` env)
|
||||
|
||||
## TextModel
|
||||
|
||||
```go
|
||||
type TextModel interface {
|
||||
Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
|
||||
Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
|
||||
ModelType() string // e.g. "gemma3", "qwen3", "llama3"
|
||||
Err() error // Check after iterator stops
|
||||
Close() error // SIGTERM llama-server, wait for exit
|
||||
}
|
||||
```
|
||||
|
||||
### Key behaviours
|
||||
|
||||
- `Generate()` and `Chat()` return `iter.Seq[Token]` — Go 1.23+ range-over-func
|
||||
- `iter.Seq` cannot carry errors — consumers must check `Err()` after the loop
|
||||
- `context.Context` enables cancellation (close SSE stream, don't kill server)
|
||||
- `Close()` sends SIGTERM to llama-server subprocess and waits for clean exit
|
||||
- `ModelType()` should parse from GGUF metadata or llama-server `/props` endpoint
|
||||
|
||||
## Token
|
||||
|
||||
```go
|
||||
type Token struct {
|
||||
ID int32
|
||||
Text string
|
||||
}
|
||||
```
|
||||
|
||||
## Message
|
||||
|
||||
```go
|
||||
type Message struct {
|
||||
Role string // "system", "user", "assistant"
|
||||
Content string
|
||||
}
|
||||
```
|
||||
|
||||
## GenerateConfig (via options)
|
||||
|
||||
```go
|
||||
type GenerateConfig struct {
|
||||
MaxTokens int // Default: 256
|
||||
Temperature float32 // Default: 0.0 (greedy)
|
||||
TopK int
|
||||
TopP float32
|
||||
StopTokens []int32
|
||||
RepeatPenalty float32
|
||||
}
|
||||
```
|
||||
|
||||
Map these to llama-server's OpenAI-compatible API fields:
|
||||
- `MaxTokens` → `max_tokens`
|
||||
- `Temperature` → `temperature`
|
||||
- `TopK` → `top_k` (llama.cpp extension)
|
||||
- `TopP` → `top_p`
|
||||
- `RepeatPenalty` → `repeat_penalty` (llama.cpp extension)
|
||||
|
||||
## LoadConfig (via options)
|
||||
|
||||
```go
|
||||
type LoadConfig struct {
|
||||
Backend string // "rocm" (or empty for auto)
|
||||
ContextLen int // → --ctx-size (0 = model default)
|
||||
GPULayers int // → --n-gpu-layers (-1 = all)
|
||||
}
|
||||
```
|
||||
|
||||
## Registration
|
||||
|
||||
Already done in `register_rocm.go`:
|
||||
|
||||
```go
|
||||
//go:build linux && amd64
|
||||
|
||||
func init() {
|
||||
inference.Register(&rocmBackend{})
|
||||
}
|
||||
```
|
||||
|
||||
## Source
|
||||
|
||||
The full interface code is at `/home/claude/Code/core/go-inference/inference.go` and `options.go`.
|
||||
66
Models.md
Normal file
66
Models.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
# Models
|
||||
|
||||
## VRAM Budget
|
||||
|
||||
The RX 7800 XT has 16GB GDDR6. Budget:
|
||||
|
||||
```
|
||||
Total VRAM: 16,384 MiB
|
||||
Model: varies (see table)
|
||||
KV cache: ~80-500 MiB (depends on context length + layers)
|
||||
Compute buffer: ~500 MiB
|
||||
Overhead: ~200 MiB
|
||||
─────────────────────────────
|
||||
Available for model: ~15,000 MiB
|
||||
```
|
||||
|
||||
## Available Models
|
||||
|
||||
All at `/data/lem/gguf/` (SMB mount from M3 Ultra at `//10.69.69.108/Data`).
|
||||
|
||||
| Model | File | Size | VRAM (est) | Fits? | Notes |
|
||||
|-------|------|------|------------|-------|-------|
|
||||
| Gemma3-1B Q5 | LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf | ~0.9G | ~1.5G | Yes | Smallest, good for quick tests |
|
||||
| Gemma3-1B Q8 | LEK-Gemma3-1B-layered-v2-Q8_0.gguf | ~1.4G | ~2G | Yes | Higher quality 1B |
|
||||
| Gemma3-4B Q4 | LEK-Gemma3-4B-Q4_K_M.gguf | 2.7G | ~3.4G | Yes | **Baseline tested: 109 tok/s** |
|
||||
| Gemma3-12B Q4 | LEK-Gemma3-12B-Q4_K_M.gguf | ~7.5G | ~9G | Yes | Good quality/speed balance |
|
||||
| Gemma3-27B Q4 | LEK-Gemma3-27B-Q4_K_M.gguf | ~16G | ~18G | Tight | May need reduced context |
|
||||
| Llama-3.1-8B Q4 | LEK-Llama-3.1-8B-Q4_K_M.gguf | ~5G | ~6G | Yes | |
|
||||
| Mistral-7B Q4 | LEK-Mistral-7B-v0.3-Q4_K_M.gguf | ~4G | ~5G | Yes | |
|
||||
| Qwen-2.5-7B Q4 | LEK-Qwen-2.5-7B-Q4_K_M.gguf | ~4G | ~5G | Yes | |
|
||||
|
||||
## Recommended Test Model
|
||||
|
||||
Use **LEK-Gemma3-4B-Q4_K_M.gguf** for development/testing:
|
||||
- Small enough to load fast (~6s)
|
||||
- Leaves plenty of VRAM headroom
|
||||
- Has validated baseline: 396 tok/s prefill, 109 tok/s decode
|
||||
- Chat template works natively in llama-server
|
||||
|
||||
For integration tests that need to be fast, use **LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf**.
|
||||
|
||||
## SMB Mount Note
|
||||
|
||||
Models are on an SMB mount from the M3 Ultra. If the mount is unavailable:
|
||||
|
||||
```go
|
||||
if _, err := os.Stat(modelPath); os.IsNotExist(err) {
|
||||
t.Skip("model not available (SMB mount down?)")
|
||||
}
|
||||
```
|
||||
|
||||
For CI, models would need to be copied locally or tests skipped.
|
||||
|
||||
## Baseline Benchmarks
|
||||
|
||||
Gemma3-4B-Q4_K_M on RX 7800 XT (Charon, 19 Feb 2026):
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Prefill | 396 tok/s (2.5ms/tok) |
|
||||
| Decode | 109 tok/s (9.2ms/tok) |
|
||||
| Time to first token | ~40ms (16 token prompt) |
|
||||
| Startup | ~6s (load + warmup) |
|
||||
| Context | 4096 (model supports 131072) |
|
||||
| Flash attention | Auto-enabled |
|
||||
| Concurrent slots | 4 |
|
||||
Loading…
Add table
Reference in a new issue