Architecture

How It Works

Go process (go-rocm)
  │
  ├─ Available()     Checks: /dev/kfd + rocm-smi + llama-server binary
  │
  ├─ LoadModel()     Spawns llama-server subprocess
  │    │
  │    ├─ HIP_VISIBLE_DEVICES=0 (mask iGPU)
  │    ├─ --model /path/to/model.gguf
  │    ├─ --port <random free port>
  │    ├─ --n-gpu-layers <from LoadConfig.GPULayers>
  │    ├─ --ctx-size <from LoadConfig.ContextLen>
  │    └─ Waits for /health endpoint → returns TextModel
  │
  ├─ Generate()      POST /v1/completions (SSE streaming)
  ├─ Chat()          POST /v1/chat/completions (SSE streaming)
  │    └─ Parses: data: {"choices":[{"delta":{"content":"..."}}]}
  │
  └─ Close()         SIGTERM → wait for clean exit

File Layout

go-rocm/
├── rocm.go              Package doc (exists)
├── register_rocm.go     //go:build linux && amd64 — init() registers backend (exists)
├── rocm_stub.go         //go:build !linux || !amd64 — ROCmAvailable() false (exists)
├── backend.go           inference.Backend: Name(), Available(), LoadModel() (exists, stubs)
├── model.go             inference.TextModel: Generate(), Chat(), Err(), Close() (TODO Phase 1)
├── server.go            llama-server lifecycle: spawn, health poll, shutdown (TODO Phase 1)
├── go.mod               Module with go-inference dependency (exists)
├── CLAUDE.md            Build/arch/standards contract (exists)
├── TODO.md              Phased task queue (exists)
├── FINDINGS.md          Research notes (exists)
└── internal/
    └── llamacpp/
        ├── client.go    OpenAI-compatible HTTP client with SSE parsing (TODO Phase 1)
        └── health.go    Health check + readiness probe (TODO Phase 1)

Design Decisions

Subprocess over CGO

llama-server (subprocess) was chosen over direct HIP CGO bindings because:

Maturity: llama-server has millions of users. CGO bindings would start from scratch.
Model support: 50+ architectures via GGUF. CGO would start with zero.
Isolation: GPU crash in subprocess doesn't kill the Go process.
Maintenance: llama.cpp team handles ROCm compat. We just build.
Portability: Same pattern works for CUDA (NVIDIA) and SYCL (Intel).

Trade-offs:

~50ms latency for first token (subprocess startup + model load)
HTTP overhead vs in-process calls
Can't share GPU memory between Go and llama-server

Sibling: go-mlx

go-mlx (macOS) uses direct CGO because MLX is designed for embedding. Both implement the same go-inference.TextModel interface, allowing consumers to import either (or both) transparently.

Coding Standards

UK English
Tests: testify assert/require
Conventional commits: feat:, fix:, test:, docs:
Co-Author: Co-Authored-By: Virgil <virgil@lethean.io>
Licence: EUPL-1.2
Build tags: //go:build linux && amd64 for GPU code, stub for other platforms