Page:
Architecture
No results
1
Architecture
Claude edited this page 2026-02-19 20:07:58 +00:00
Architecture
How It Works
Go process (go-rocm)
│
├─ Available() Checks: /dev/kfd + rocm-smi + llama-server binary
│
├─ LoadModel() Spawns llama-server subprocess
│ │
│ ├─ HIP_VISIBLE_DEVICES=0 (mask iGPU)
│ ├─ --model /path/to/model.gguf
│ ├─ --port <random free port>
│ ├─ --n-gpu-layers <from LoadConfig.GPULayers>
│ ├─ --ctx-size <from LoadConfig.ContextLen>
│ └─ Waits for /health endpoint → returns TextModel
│
├─ Generate() POST /v1/completions (SSE streaming)
├─ Chat() POST /v1/chat/completions (SSE streaming)
│ └─ Parses: data: {"choices":[{"delta":{"content":"..."}}]}
│
└─ Close() SIGTERM → wait for clean exit
File Layout
go-rocm/
├── rocm.go Package doc (exists)
├── register_rocm.go //go:build linux && amd64 — init() registers backend (exists)
├── rocm_stub.go //go:build !linux || !amd64 — ROCmAvailable() false (exists)
├── backend.go inference.Backend: Name(), Available(), LoadModel() (exists, stubs)
├── model.go inference.TextModel: Generate(), Chat(), Err(), Close() (TODO Phase 1)
├── server.go llama-server lifecycle: spawn, health poll, shutdown (TODO Phase 1)
├── go.mod Module with go-inference dependency (exists)
├── CLAUDE.md Build/arch/standards contract (exists)
├── TODO.md Phased task queue (exists)
├── FINDINGS.md Research notes (exists)
└── internal/
└── llamacpp/
├── client.go OpenAI-compatible HTTP client with SSE parsing (TODO Phase 1)
└── health.go Health check + readiness probe (TODO Phase 1)
Design Decisions
Subprocess over CGO
llama-server (subprocess) was chosen over direct HIP CGO bindings because:
- Maturity: llama-server has millions of users. CGO bindings would start from scratch.
- Model support: 50+ architectures via GGUF. CGO would start with zero.
- Isolation: GPU crash in subprocess doesn't kill the Go process.
- Maintenance: llama.cpp team handles ROCm compat. We just build.
- Portability: Same pattern works for CUDA (NVIDIA) and SYCL (Intel).
Trade-offs:
- ~50ms latency for first token (subprocess startup + model load)
- HTTP overhead vs in-process calls
- Can't share GPU memory between Go and llama-server
Sibling: go-mlx
go-mlx (macOS) uses direct CGO because MLX is designed for embedding. Both implement the same go-inference.TextModel interface, allowing consumers to import either (or both) transparently.
Coding Standards
- UK English
- Tests:
testifyassert/require - Conventional commits:
feat:,fix:,test:,docs: - Co-Author:
Co-Authored-By: Virgil <virgil@lethean.io> - Licence: EUPL-1.2
- Build tags:
//go:build linux && amd64for GPU code, stub for other platforms