feat: scaffold go-rocm AMD GPU inference package
Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm). Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM). Includes: - Backend registration with build tags (linux/amd64) - Stub backend.go with llama-server lifecycle outline - CLAUDE.md with build instructions for llama.cpp + ROCm - TODO.md with 5-phase task queue - FINDINGS.md with hardware specs, VRAM budget, design rationale Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
parent
252e28e81e
commit
aa42cff417
8 changed files with 350 additions and 0 deletions
102
CLAUDE.md
Normal file
102
CLAUDE.md
Normal file
|
|
@ -0,0 +1,102 @@
|
|||
# CLAUDE.md
|
||||
|
||||
## What This Is
|
||||
|
||||
AMD ROCm GPU inference for Linux. Module: `forge.lthn.ai/core/go-rocm`
|
||||
|
||||
Implements `inference.Backend` and `inference.TextModel` (from `core/go-inference`) using llama.cpp compiled with HIP/ROCm. Targets AMD RDNA 3+ GPUs.
|
||||
|
||||
## Target Hardware
|
||||
|
||||
- **GPU**: AMD Radeon RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM)
|
||||
- **OS**: Ubuntu 24.04 LTS (linux/amd64)
|
||||
- **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
|
||||
- **Kernel**: 6.10+ recommended for RDNA 3 stability
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
go test ./... # Run all tests (stubs on non-Linux)
|
||||
go test -tags rocm ./... # Run with ROCm integration tests
|
||||
|
||||
# On the Linux homelab:
|
||||
go test -v -run TestROCm ./... # Full GPU tests
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
go-rocm (this package)
|
||||
├── rocm.go Package doc
|
||||
├── register_rocm.go //go:build linux && amd64 — auto-registers via init()
|
||||
├── rocm_stub.go //go:build !linux || !amd64 — ROCmAvailable() false
|
||||
├── backend.go inference.Backend implementation
|
||||
├── model.go inference.TextModel implementation (TODO)
|
||||
├── server.go llama-server lifecycle management (TODO)
|
||||
└── internal/
|
||||
└── llamacpp/ llama-server HTTP client (TODO)
|
||||
├── client.go OpenAI-compatible API client
|
||||
└── health.go Health check + readiness probe
|
||||
```
|
||||
|
||||
### How It Works
|
||||
|
||||
1. `LoadModel()` spawns `llama-server` (llama.cpp) as a subprocess
|
||||
2. llama-server loads the GGUF model onto the AMD GPU via HIP/ROCm
|
||||
3. `Generate()` / `Chat()` make HTTP requests to llama-server's OpenAI-compatible API
|
||||
4. Token streaming via SSE (Server-Sent Events) from llama-server
|
||||
5. `Close()` sends SIGTERM to llama-server, waits for clean exit
|
||||
|
||||
This is the subprocess approach (not CGO). It's simpler, more maintainable, and llama.cpp's server mode is battle-tested.
|
||||
|
||||
### Dependencies
|
||||
|
||||
- `forge.lthn.ai/core/go-inference` — shared TextModel/Backend interfaces
|
||||
- llama-server binary (external, not Go dependency) built with `-DGGML_HIP=ON`
|
||||
|
||||
## Building llama-server with ROCm
|
||||
|
||||
```bash
|
||||
# On the Linux homelab:
|
||||
sudo apt install rocm-dev rocm-libs # ROCm 6.x
|
||||
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
cmake -B build \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS=gfx1100 \
|
||||
-DGGML_HIP_ROCWMMA_FATTN=ON \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --parallel -t llama-server
|
||||
|
||||
# Binary at build/bin/llama-server
|
||||
# Copy to /usr/local/bin/ or set ROCM_LLAMA_SERVER_PATH
|
||||
```
|
||||
|
||||
### Performance Tip
|
||||
|
||||
The RX 7800 XT is gfx1101 but the ROCm compiler generates identical code for gfx1100. Setting:
|
||||
```bash
|
||||
export HSA_OVERRIDE_GFX_VERSION=11.0.0
|
||||
```
|
||||
...gives better performance on some ROCm versions. Benchmark both.
|
||||
|
||||
## Coding Standards
|
||||
|
||||
- UK English
|
||||
- Tests: testify assert/require
|
||||
- Conventional commits
|
||||
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
|
||||
- Licence: EUPL-1.2
|
||||
|
||||
## Coordination
|
||||
|
||||
- **Virgil** (core/go) is the orchestrator — writes tasks here
|
||||
- **go-mlx Claude** is the sibling — Metal backend on macOS, same interface contract
|
||||
- **go-inference** defines the shared TextModel/Backend interfaces both backends implement
|
||||
- **go-ml** wraps both backends into the scoring engine
|
||||
|
||||
## Task Queue
|
||||
|
||||
See `TODO.md` for prioritised work.
|
||||
See `FINDINGS.md` for research notes.
|
||||
94
FINDINGS.md
Normal file
94
FINDINGS.md
Normal file
|
|
@ -0,0 +1,94 @@
|
|||
# FINDINGS.md — go-rocm Research & Discovery
|
||||
|
||||
---
|
||||
|
||||
## 2026-02-19: Package Creation (Virgil)
|
||||
|
||||
### Hardware
|
||||
|
||||
- **GPU**: AMD Radeon RX 7800 XT
|
||||
- **Architecture**: RDNA 3, gfx1101
|
||||
- **VRAM**: 16GB GDDR6
|
||||
- **Compute Units**: 60
|
||||
- **OS**: Linux (Ubuntu, homelab machine)
|
||||
|
||||
### ROCm Support Status
|
||||
|
||||
- gfx1100/gfx1101 officially supported in ROCm 6.x+
|
||||
- Supported on Ubuntu 24.04.3 and 22.04.5
|
||||
- Kernel 6.10+ recommended for RDNA 3 stability
|
||||
- `/dev/kfd` device node required (amdgpu kernel driver)
|
||||
|
||||
Sources:
|
||||
- [ROCm system requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html)
|
||||
- [ROCm compatibility matrix](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html)
|
||||
|
||||
### llama.cpp + ROCm
|
||||
|
||||
llama.cpp has mature ROCm/HIP support. Build flags:
|
||||
|
||||
```bash
|
||||
cmake -B build \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS=gfx1100 \
|
||||
-DGGML_HIP_ROCWMMA_FATTN=ON \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
```
|
||||
|
||||
Key findings:
|
||||
- RX 7800 XT is gfx1101, but ROCm compiler generates identical code for gfx1100
|
||||
- `HSA_OVERRIDE_GFX_VERSION=11.0.0` may give better performance (benchmark needed)
|
||||
- rocWMMA flash attention (`-DGGML_HIP_ROCWMMA_FATTN=ON`) available for RDNA 3+
|
||||
- Docker images may not support hipBLASLt for gfx1100, falling back to hipBLAS
|
||||
- llama-server provides OpenAI-compatible API with SSE streaming
|
||||
|
||||
Sources:
|
||||
- [llama.cpp ROCm build docs](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
|
||||
- [llama.cpp ROCm compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
|
||||
- [llama.cpp ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
|
||||
- [RX 7800 XT build discussion](https://github.com/ggml-org/llama.cpp/discussions/11572)
|
||||
|
||||
### Design Decision: Subprocess vs CGO
|
||||
|
||||
**Chose subprocess** (llama-server) over direct HIP CGO bindings because:
|
||||
|
||||
1. **Maturity**: llama-server is battle-tested with millions of users. Direct HIP CGO would take months to reach comparable stability.
|
||||
2. **Model support**: llama.cpp supports 50+ model architectures via GGUF. CGO would start with zero.
|
||||
3. **Maintenance**: llama.cpp team handles ROCm compatibility. We just build the binary.
|
||||
4. **Isolation**: GPU crashes in the subprocess don't take down the Go process.
|
||||
5. **Portability**: Same approach works for NVIDIA (CUDA build), Intel (SYCL build) with minimal code changes.
|
||||
|
||||
Trade-offs:
|
||||
- Subprocess adds ~50ms latency for first token (process startup + model load)
|
||||
- Inter-process communication overhead (HTTP vs in-process)
|
||||
- Can't share GPU memory between Go process and llama-server
|
||||
|
||||
The go-mlx package uses direct CGO because MLX is a C library designed for embedding. llama.cpp's primary API is its server mode.
|
||||
|
||||
### VRAM Budget (16GB)
|
||||
|
||||
| Model | Quant | VRAM (model) | Context (4K) | Total | Fits? |
|
||||
|-------|-------|-------------|-------------|-------|-------|
|
||||
| Qwen3-8B | Q4_K_M | ~5GB | ~0.5GB | ~5.5GB | Yes |
|
||||
| Gemma3-4B | Q4_K_M | ~3GB | ~0.3GB | ~3.3GB | Yes |
|
||||
| Llama3-8B | Q4_K_M | ~5GB | ~0.5GB | ~5.5GB | Yes |
|
||||
| Qwen3-8B | Q8_0 | ~9GB | ~0.5GB | ~9.5GB | Yes |
|
||||
| Llama3-70B | Q4_K_M | ~40GB | ~2GB | ~42GB | No (partial offload) |
|
||||
|
||||
16GB VRAM comfortably runs any 8B model in Q4 or Q8 quantisation. 13B models fit in Q4. Larger models need partial GPU offload (GPULayers option).
|
||||
|
||||
---
|
||||
|
||||
## 2026-02-19: Sibling Architecture (go-mlx comparison)
|
||||
|
||||
| Aspect | go-mlx (macOS) | go-rocm (Linux) |
|
||||
|--------|---------------|-----------------|
|
||||
| GPU | Apple Metal (M-series) | AMD ROCm (RDNA 3) |
|
||||
| Build tag | `darwin && arm64` | `linux && amd64` |
|
||||
| Approach | Direct CGO (mlx-c) | Subprocess (llama-server) |
|
||||
| Model format | Safetensors | GGUF |
|
||||
| Shared interface | `go-inference.TextModel` | `go-inference.TextModel` |
|
||||
| Memory control | `SetCacheLimit`, `GetActiveMemory` | `rocm-smi` / HIP API |
|
||||
| Chat templates | Built into model code | llama-server `--chat-template` |
|
||||
|
||||
Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
|
||||
75
TODO.md
Normal file
75
TODO.md
Normal file
|
|
@ -0,0 +1,75 @@
|
|||
# TODO.md — go-rocm Task Queue
|
||||
|
||||
Dispatched from core/go orchestration. Pick up tasks in order.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Environment Setup (on Linux homelab)
|
||||
|
||||
- [ ] **Install ROCm 6.x** — Follow [ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/). Ubuntu 24.04 recommended. Verify with `rocm-smi` showing RX 7800 XT.
|
||||
- [ ] **Build llama-server with HIP** — Clone llama.cpp, build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Verify binary runs: `llama-server --help`.
|
||||
- [ ] **Test manual inference** — Download a GGUF model (e.g. Qwen3-8B-Q4_K_M). Run `llama-server --model /path/to/model.gguf -ngl 99`. Test with curl against the OpenAI-compatible API. Record tokens/sec.
|
||||
- [ ] **HSA_OVERRIDE_GFX_VERSION benchmark** — Test with `11.0.0` vs `11.0.1` vs unset. The RX 7800 XT is gfx1101 but gfx1100 codegen may be faster. Record results in FINDINGS.md.
|
||||
|
||||
## Phase 1: Core Implementation
|
||||
|
||||
- [ ] **GPU detection** — Implement `Available()` in backend.go. Check: `/dev/kfd` exists (ROCm kernel driver), `rocm-smi` detects GPU, llama-server binary is findable (PATH or `ROCM_LLAMA_SERVER_PATH` env).
|
||||
- [ ] **Server lifecycle** — Create `server.go`: spawn llama-server with `--model`, `--port` (random free port), `--n-gpu-layers` (from LoadConfig.GPULayers), `--ctx-size` (from LoadConfig.ContextLen). Wait for `/health` endpoint. Handle SIGTERM on Close().
|
||||
- [ ] **HTTP client** — Create `internal/llamacpp/client.go`: POST `/v1/chat/completions` with streaming (SSE). Parse `data: {"choices":[{"delta":{"content":"..."}}]}` into inference.Token stream.
|
||||
- [ ] **TextModel implementation** — Create `model.go`: implement inference.TextModel wrapping the HTTP client. Generate() sends single-turn prompt, Chat() sends multi-turn messages. Both stream via iter.Seq[Token]. Err() returns last error.
|
||||
- [ ] **Integration test** — Test end-to-end: LoadModel → Generate → tokens received → Close. Requires GGUF model on disk. Use `t.Skip()` when model/GPU unavailable.
|
||||
|
||||
## Phase 2: Robustness
|
||||
|
||||
- [ ] **Server crash recovery** — If llama-server dies mid-generation, detect via process exit, return error via Err(), allow re-load.
|
||||
- [ ] **Port conflict handling** — If the random port is taken, retry with a different port.
|
||||
- [ ] **Graceful shutdown** — On context cancellation, stop the current request cleanly (close SSE stream), don't kill the server. Only Close() kills the server.
|
||||
- [ ] **Memory monitoring** — Use `rocm-smi --showmeminfo vram` or HIP API to report VRAM usage. Expose via package-level functions (like go-mlx's GetActiveMemory).
|
||||
- [ ] **Concurrent requests** — llama-server supports concurrent slots. Test with multiple goroutines calling Generate() simultaneously. Document max concurrency.
|
||||
|
||||
## Phase 3: Model Support
|
||||
|
||||
- [ ] **GGUF model discovery** — Implement model path scanning: find .gguf files, parse metadata (model name, params, quant level, size). Return structured inventory.
|
||||
- [ ] **Chat templates** — llama-server handles chat templates natively via `--chat-template`. Verify Gemma3, Qwen3, Llama3 templates work. If not, add template formatting in model.go.
|
||||
- [ ] **Context window sizing** — Auto-detect optimal context window from model metadata. Default to 4096 if unknown.
|
||||
|
||||
## Phase 4: Performance
|
||||
|
||||
- [ ] **Benchmark suite** — Measure: tokens/sec (prefill + decode), time-to-first-token, VRAM usage, for Qwen3-8B-Q4, Gemma3-4B, Llama3-8B on the RX 7800 XT. Compare with mlx on M3 Ultra.
|
||||
- [ ] **Flash attention** — Verify `-DGGML_HIP_ROCWMMA_FATTN=ON` gives real speedup on gfx1100. Benchmark with and without.
|
||||
- [ ] **Batch inference** — llama-server supports multiple slots for concurrent inference. Test parallel prompts for go-i18n's batch classification use case.
|
||||
|
||||
## Phase 5: Alternative Backends
|
||||
|
||||
- [ ] **Direct HIP/CGO** — Evaluate whether direct HIP CGO bindings (like go-mlx does for Metal) would be worth the effort. Only if llama-server subprocess becomes a bottleneck.
|
||||
- [ ] **vLLM backend** — vLLM supports ROCm and has better batching. Could be an alternative subprocess backend for high-throughput scenarios.
|
||||
|
||||
---
|
||||
|
||||
## Model Inventory (on Linux homelab)
|
||||
|
||||
Download to `/data/models/` (or wherever the homelab stores data):
|
||||
- [ ] Qwen3-8B-Q4_K_M.gguf (~5GB, fits 16GB VRAM with room for context)
|
||||
- [ ] Gemma3-4B-Q4_K_M.gguf (~3GB)
|
||||
- [ ] Llama-3.1-8B-Q4_K_M.gguf (~5GB)
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
|
||||
| `HSA_OVERRIDE_GFX_VERSION` | unset | Override GPU arch for ROCm compiler |
|
||||
| `ROCM_MODEL_DIR` | none | Default directory for model discovery |
|
||||
|
||||
## Upstream Dependencies
|
||||
|
||||
- **go-inference** defines the TextModel/Backend interfaces this package implements
|
||||
- **go-ml** will wrap this backend (Virgil creates backend_rocm.go when the API is ready)
|
||||
- **go-i18n** may use this for batch classification on Linux (Phase 4)
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Virgil in core/go writes tasks here after research
|
||||
2. This repo's session (on Linux homelab) picks up tasks in phase order
|
||||
3. Mark `[x]` when done, note commit hash
|
||||
4. New discoveries → add tasks, flag in FINDINGS.md
|
||||
28
backend.go
Normal file
28
backend.go
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
//go:build linux && amd64
|
||||
|
||||
package rocm
|
||||
|
||||
import "forge.lthn.ai/core/go-inference"
|
||||
|
||||
// rocmBackend implements inference.Backend for AMD ROCm GPUs.
|
||||
// Uses llama-server (llama.cpp built with HIP) as the inference engine.
|
||||
type rocmBackend struct{}
|
||||
|
||||
func (b *rocmBackend) Name() string { return "rocm" }
|
||||
|
||||
func (b *rocmBackend) Available() bool {
|
||||
// TODO: Check for ROCm runtime + GPU presence
|
||||
// - /dev/kfd exists (ROCm kernel driver)
|
||||
// - rocm-smi detects a GPU
|
||||
// - llama-server binary is findable
|
||||
return false // Stub until Phase 1 implementation
|
||||
}
|
||||
|
||||
func (b *rocmBackend) LoadModel(path string, opts ...inference.LoadOption) (inference.TextModel, error) {
|
||||
// TODO: Phase 1 implementation
|
||||
// 1. Find llama-server binary (PATH or configured location)
|
||||
// 2. Spawn llama-server with --model path --port <free> --n-gpu-layers cfg.GPULayers
|
||||
// 3. Wait for health endpoint to respond
|
||||
// 4. Return rocmModel wrapping the HTTP client
|
||||
return nil, nil
|
||||
}
|
||||
7
go.mod
Normal file
7
go.mod
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
module forge.lthn.ai/core/go-rocm
|
||||
|
||||
go 1.25.5
|
||||
|
||||
require forge.lthn.ai/core/go-inference v0.0.0
|
||||
|
||||
replace forge.lthn.ai/core/go-inference => ../go-inference
|
||||
12
register_rocm.go
Normal file
12
register_rocm.go
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
//go:build linux && amd64
|
||||
|
||||
package rocm
|
||||
|
||||
import "forge.lthn.ai/core/go-inference"
|
||||
|
||||
func init() {
|
||||
inference.Register(&rocmBackend{})
|
||||
}
|
||||
|
||||
// ROCmAvailable reports whether ROCm GPU inference is available.
|
||||
func ROCmAvailable() bool { return true }
|
||||
25
rocm.go
Normal file
25
rocm.go
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
// Package rocm provides AMD ROCm GPU inference for Linux.
|
||||
//
|
||||
// This package implements the inference.Backend and inference.TextModel interfaces
|
||||
// using llama.cpp compiled with HIP/ROCm for AMD GPUs (RDNA 3+).
|
||||
//
|
||||
// # Quick Start
|
||||
//
|
||||
// import (
|
||||
// "forge.lthn.ai/core/go-inference"
|
||||
// _ "forge.lthn.ai/core/go-rocm" // auto-registers ROCm backend
|
||||
// )
|
||||
//
|
||||
// m, err := inference.LoadModel("/path/to/model.gguf")
|
||||
// defer m.Close()
|
||||
// for tok := range m.Generate(ctx, "Hello", inference.WithMaxTokens(128)) {
|
||||
// fmt.Print(tok.Text)
|
||||
// }
|
||||
//
|
||||
// # Requirements
|
||||
//
|
||||
// - Linux (amd64)
|
||||
// - AMD GPU with ROCm support (RDNA 2+ / gfx10xx+, tested on RDNA 3 / gfx1100)
|
||||
// - ROCm 6.x+ installed
|
||||
// - llama-server binary (from llama.cpp built with -DGGML_HIP=ON)
|
||||
package rocm
|
||||
7
rocm_stub.go
Normal file
7
rocm_stub.go
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
//go:build !linux || !amd64
|
||||
|
||||
package rocm
|
||||
|
||||
// ROCmAvailable reports whether ROCm GPU inference is available.
|
||||
// Returns false on non-Linux or non-amd64 platforms.
|
||||
func ROCmAvailable() bool { return false }
|
||||
Loading…
Add table
Reference in a new issue