go-rocm/FINDINGS.md
Snider aa42cff417 feat: scaffold go-rocm AMD GPU inference package
Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm).
Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM).

Includes:
- Backend registration with build tags (linux/amd64)
- Stub backend.go with llama-server lifecycle outline
- CLAUDE.md with build instructions for llama.cpp + ROCm
- TODO.md with 5-phase task queue
- FINDINGS.md with hardware specs, VRAM budget, design rationale

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:39:40 +00:00

4 KiB

FINDINGS.md — go-rocm Research & Discovery


2026-02-19: Package Creation (Virgil)

Hardware

  • GPU: AMD Radeon RX 7800 XT
  • Architecture: RDNA 3, gfx1101
  • VRAM: 16GB GDDR6
  • Compute Units: 60
  • OS: Linux (Ubuntu, homelab machine)

ROCm Support Status

  • gfx1100/gfx1101 officially supported in ROCm 6.x+
  • Supported on Ubuntu 24.04.3 and 22.04.5
  • Kernel 6.10+ recommended for RDNA 3 stability
  • /dev/kfd device node required (amdgpu kernel driver)

Sources:

llama.cpp + ROCm

llama.cpp has mature ROCm/HIP support. Build flags:

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release

Key findings:

  • RX 7800 XT is gfx1101, but ROCm compiler generates identical code for gfx1100
  • HSA_OVERRIDE_GFX_VERSION=11.0.0 may give better performance (benchmark needed)
  • rocWMMA flash attention (-DGGML_HIP_ROCWMMA_FATTN=ON) available for RDNA 3+
  • Docker images may not support hipBLASLt for gfx1100, falling back to hipBLAS
  • llama-server provides OpenAI-compatible API with SSE streaming

Sources:

Design Decision: Subprocess vs CGO

Chose subprocess (llama-server) over direct HIP CGO bindings because:

  1. Maturity: llama-server is battle-tested with millions of users. Direct HIP CGO would take months to reach comparable stability.
  2. Model support: llama.cpp supports 50+ model architectures via GGUF. CGO would start with zero.
  3. Maintenance: llama.cpp team handles ROCm compatibility. We just build the binary.
  4. Isolation: GPU crashes in the subprocess don't take down the Go process.
  5. Portability: Same approach works for NVIDIA (CUDA build), Intel (SYCL build) with minimal code changes.

Trade-offs:

  • Subprocess adds ~50ms latency for first token (process startup + model load)
  • Inter-process communication overhead (HTTP vs in-process)
  • Can't share GPU memory between Go process and llama-server

The go-mlx package uses direct CGO because MLX is a C library designed for embedding. llama.cpp's primary API is its server mode.

VRAM Budget (16GB)

Model Quant VRAM (model) Context (4K) Total Fits?
Qwen3-8B Q4_K_M ~5GB ~0.5GB ~5.5GB Yes
Gemma3-4B Q4_K_M ~3GB ~0.3GB ~3.3GB Yes
Llama3-8B Q4_K_M ~5GB ~0.5GB ~5.5GB Yes
Qwen3-8B Q8_0 ~9GB ~0.5GB ~9.5GB Yes
Llama3-70B Q4_K_M ~40GB ~2GB ~42GB No (partial offload)

16GB VRAM comfortably runs any 8B model in Q4 or Q8 quantisation. 13B models fit in Q4. Larger models need partial GPU offload (GPULayers option).


2026-02-19: Sibling Architecture (go-mlx comparison)

Aspect go-mlx (macOS) go-rocm (Linux)
GPU Apple Metal (M-series) AMD ROCm (RDNA 3)
Build tag darwin && arm64 linux && amd64
Approach Direct CGO (mlx-c) Subprocess (llama-server)
Model format Safetensors GGUF
Shared interface go-inference.TextModel go-inference.TextModel
Memory control SetCacheLimit, GetActiveMemory rocm-smi / HIP API
Chat templates Built into model code llama-server --chat-template

Both register as inference.Backend via build-tagged init(). go-ml wraps both transparently.