docs: flag Token.ID and StopTokens interface questions for Virgil

QUESTION: Token.ID always 0 — llama-server SSE doesn't include token IDs
QUESTION: StopTokens []int32 vs llama-server stop []string mismatch

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-19 20:41:53 +00:00

8.1 KiB

Raw Blame History

FINDINGS.md — go-rocm Research & Discovery

2026-02-19: Package Creation (Virgil)

Hardware

GPU: AMD Radeon RX 7800 XT
Architecture: RDNA 3, gfx1101
VRAM: 16GB GDDR6
Compute Units: 60
OS: Linux (Ubuntu, homelab machine)

ROCm Support Status

gfx1100/gfx1101 officially supported in ROCm 6.x+
Supported on Ubuntu 24.04.3 and 22.04.5
Kernel 6.10+ recommended for RDNA 3 stability
/dev/kfd device node required (amdgpu kernel driver)

Sources:

llama.cpp + ROCm

llama.cpp has mature ROCm/HIP support. Build flags:

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release

Key findings:

RX 7800 XT is gfx1101, but ROCm compiler generates identical code for gfx1100
HSA_OVERRIDE_GFX_VERSION=11.0.0 may give better performance (benchmark needed)
rocWMMA flash attention (-DGGML_HIP_ROCWMMA_FATTN=ON) available for RDNA 3+
Docker images may not support hipBLASLt for gfx1100, falling back to hipBLAS
llama-server provides OpenAI-compatible API with SSE streaming

Sources:

Design Decision: Subprocess vs CGO

Chose subprocess (llama-server) over direct HIP CGO bindings because:

Maturity: llama-server is battle-tested with millions of users. Direct HIP CGO would take months to reach comparable stability.
Model support: llama.cpp supports 50+ model architectures via GGUF. CGO would start with zero.
Maintenance: llama.cpp team handles ROCm compatibility. We just build the binary.
Isolation: GPU crashes in the subprocess don't take down the Go process.
Portability: Same approach works for NVIDIA (CUDA build), Intel (SYCL build) with minimal code changes.

Trade-offs:

Subprocess adds ~50ms latency for first token (process startup + model load)
Inter-process communication overhead (HTTP vs in-process)
Can't share GPU memory between Go process and llama-server

The go-mlx package uses direct CGO because MLX is a C library designed for embedding. llama.cpp's primary API is its server mode.

VRAM Budget (16GB)

Model	Quant	VRAM (model)	Context (4K)	Total	Fits?
Qwen3-8B	Q4_K_M	~5GB	~0.5GB	~5.5GB	Yes
Gemma3-4B	Q4_K_M	~3GB	~0.3GB	~3.3GB	Yes
Llama3-8B	Q4_K_M	~5GB	~0.5GB	~5.5GB	Yes
Qwen3-8B	Q8_0	~9GB	~0.5GB	~9.5GB	Yes
Llama3-70B	Q4_K_M	~40GB	~2GB	~42GB	No (partial offload)

16GB VRAM comfortably runs any 8B model in Q4 or Q8 quantisation. 13B models fit in Q4. Larger models need partial GPU offload (GPULayers option).

2026-02-19: Sibling Architecture (go-mlx comparison)

Aspect	go-mlx (macOS)	go-rocm (Linux)
GPU	Apple Metal (M-series)	AMD ROCm (RDNA 3)
Build tag	`darwin && arm64`	`linux && amd64`
Approach	Direct CGO (mlx-c)	Subprocess (llama-server)
Model format	Safetensors	GGUF
Shared interface	`go-inference.TextModel`	`go-inference.TextModel`
Memory control	`SetCacheLimit`, `GetActiveMemory`	`rocm-smi` / HIP API
Chat templates	Built into model code	llama-server `--chat-template`

Both register as inference.Backend via build-tagged init(). go-ml wraps both transparently.

2026-02-19: Phase 0 Environment Validation (Charon)

Actual Hardware (corrected from Virgil's notes)

GPU arch: gfx1100 (NOT gfx1101 — rocminfo confirms)
ROCm version: 7.2.0 (newer than the 6.x minimum)
Kernel: 6.17.0-14-generic
/dev/kfd: Present, working
HSA_OVERRIDE_GFX_VERSION: Not needed — native gfx1100

llama-server Build

Source: llama.cpp commit 11c325c (cloned 19 Feb 2026)
Local build path: /home/claude/llama.cpp/build/bin/llama-server
Installed to: /usr/local/bin/llama-server

Build command:

cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc) -t llama-server

Critical: iGPU Crash

The Ryzen 9 9950X has an integrated GPU that ROCm detects as a second device:

Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real)
Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading)

llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes ROCm error: unspecified launch failure and crashes with a core dump.

Fix: Set HIP_VISIBLE_DEVICES=0 to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server.

Baseline Benchmarks — Gemma3-4B-Q4_K_M

Metric	Value
Model	LEK-Gemma3-4B-Q4_K_M (2.66 GiB)
VRAM used	~3.4 GiB of 16 GiB
Prefill (prompt)	396 tok/s (2.5ms/tok)
Decode (generation)	109 tok/s (9.2ms/tok)
Time to first token	~40ms (16 token prompt)
Startup time	~6s (load + warmup)
Context window	4096 (model supports 131072)
Flash attention	Auto-enabled
Slots	4 concurrent

GGUF Models Available

All at /data/lem/gguf/ (SMB mount from M3):

Model	Size	Fits 16GB?
LEK-Gemma3-1B-layered-v2-Q5_K_M	~0.9G	Yes
LEK-Gemma3-1B-layered-v2-Q8_0	~1.4G	Yes
LEK-Gemma3-4B-Q4_K_M	2.7G	Yes
LEK-Gemma3-12B-Q4_K_M	~7.5G	Yes
LEK-Gemma3-27B-Q4_K_M	~16G	Tight
LEK-Llama-3.1-8B-Q4_K_M	~5G	Yes
LEK-Mistral-7B-v0.3-Q4_K_M	~4G	Yes
LEK-Qwen-2.5-7B-Q4_K_M	~4G	Yes

Environment Variables for go-rocm

The server.go implementation MUST set these when spawning:

cmd.Env = append(os.Environ(),
    "HIP_VISIBLE_DEVICES=0",  // Critical: mask iGPU to prevent crash
)

Model Path Note

Models are on SMB mount (/data = //10.69.69.108/Data). For CI/testing, copy a small model locally or use t.Skip() when the mount is unavailable.

2026-02-19: Phase 1 Plan Review — Interface Questions

QUESTION: Token.ID not populated by llama-server SSE

llama-server's OpenAI-compatible streaming API (/v1/chat/completions, /v1/completions) does not include token IDs in the default SSE response. The inference.Token struct has ID int32 and Text string — go-rocm will set Text but leave ID as 0 for all tokens.

Token IDs are available via logprobs: true in the request, but this adds overhead and requires parsing the logprobs.tokens field.

Decision needed from Virgil: Does any consumer (go-ml, go-i18n, go-ai) rely on Token.ID? If only Token.Text is used downstream, ID=0 is acceptable for Phase 1. If ID is needed, we'll add logprobs parsing.

QUESTION: StopTokens type mismatch

GenerateConfig.StopTokens is []int32 (token IDs), but llama-server's OpenAI-compatible API expects "stop" as []string (text sequences). These are fundamentally different — token IDs cannot be mapped to stop strings without a tokeniser.

Options:

Ignore StopTokens in go-rocm Phase 1 (no consumer uses it yet)
Use llama-server's native /completion endpoint which supports id_slot stop tokens
Add StopStrings []string to GenerateConfig in go-inference alongside the existing StopTokens []int32, let each backend use whichever it supports

Decision needed from Virgil: Which approach? Option 3 would be a go-inference interface change. Option 1 is simplest for now — go-rocm silently ignores StopTokens if set.

8.1 KiB Raw Blame History