go-rocm/CLAUDE.md

# CLAUDE.md

## What This Is

AMD ROCm GPU inference for Linux. Module: `forge.lthn.ai/core/go-rocm`

Implements `inference.Backend` and `inference.TextModel` (from `core/go-inference`) using llama.cpp compiled with HIP/ROCm. Targets AMD RDNA 3+ GPUs.

## Target Hardware

- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101
- **OS**: Ubuntu 24.04 LTS (linux/amd64)
- **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
- **Kernel**: 6.10+ recommended for RDNA 3 stability

## Commands

```bash
go test ./...                       # Run all tests (stubs on non-Linux)
go test -tags rocm ./...            # Run with ROCm integration tests

# On the Linux homelab:
go test -v -run TestROCm ./...      # Full GPU tests
```

## Architecture

```
go-rocm (this package)
├── rocm.go              Package doc
├── register_rocm.go     //go:build linux && amd64 — auto-registers via init()
├── rocm_stub.go         //go:build !linux || !amd64 — ROCmAvailable() false
├── backend.go           inference.Backend implementation
├── model.go             inference.TextModel implementation (TODO)
├── server.go            llama-server lifecycle management (TODO)
└── internal/
    └── llamacpp/        llama-server HTTP client (TODO)
        ├── client.go    OpenAI-compatible API client
        └── health.go    Health check + readiness probe
```

### How It Works

1. `LoadModel()` spawns `llama-server` (llama.cpp) as a subprocess
2. llama-server loads the GGUF model onto the AMD GPU via HIP/ROCm
3. `Generate()` / `Chat()` make HTTP requests to llama-server's OpenAI-compatible API
4. Token streaming via SSE (Server-Sent Events) from llama-server
5. `Close()` sends SIGTERM to llama-server, waits for clean exit

This is the subprocess approach (not CGO). It's simpler, more maintainable, and llama.cpp's server mode is battle-tested.

### Dependencies

- `forge.lthn.ai/core/go-inference` — shared TextModel/Backend interfaces
- llama-server binary (external, not Go dependency) built with `-DGGML_HIP=ON`

## Building llama-server with ROCm

```bash
# On the Linux homelab:
sudo apt install rocm-dev rocm-libs  # ROCm 6.x

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel -t llama-server

# Binary at build/bin/llama-server
# Copy to /usr/local/bin/ or set ROCM_LLAMA_SERVER_PATH
```

### Performance Tip

The RX 7800 XT is gfx1101 but the ROCm compiler generates identical code for gfx1100. Setting:
```bash
export HSA_OVERRIDE_GFX_VERSION=11.0.0
```
...gives better performance on some ROCm versions. Benchmark both.

## Coding Standards

- UK English
- Tests: testify assert/require
- Conventional commits
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2

## Coordination

- **Virgil** (core/go) is the orchestrator — writes tasks here
- **go-mlx Claude** is the sibling — Metal backend on macOS, same interface contract
- **go-inference** defines the shared TextModel/Backend interfaces both backends implement
- **go-ml** wraps both backends into the scoring engine

## Task Queue

See `TODO.md` for prioritised work.
See `FINDINGS.md` for research notes.
feat: scaffold go-rocm AMD GPU inference package Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm). Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM). Includes: - Backend registration with build tags (linux/amd64) - Stub backend.go with llama-server lifecycle outline - CLAUDE.md with build instructions for llama.cpp + ROCm - TODO.md with 5-phase task queue - FINDINGS.md with hardware specs, VRAM budget, design rationale Co-Authored-By: Virgil <virgil@lethean.io> 2026-02-19 19:39:40 +00:00			`# CLAUDE.md`

			`## What This Is`

			AMD ROCm GPU inference for Linux. Module: `forge.lthn.ai/core/go-rocm`

			Implements `inference.Backend` and `inference.TextModel` (from `core/go-inference`) using llama.cpp compiled with HIP/ROCm. Targets AMD RDNA 3+ GPUs.

			`## Target Hardware`

docs: Phase 0 complete — environment validated, llama-server built - ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17 - llama-server built with HIP from llama.cpp 11c325c - Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill - Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0 - All Phase 0 tasks marked done Co-Authored-By: Virgil <virgil@lethean.io> 2026-02-19 19:57:14 +00:00			`- GPU: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101`
feat: scaffold go-rocm AMD GPU inference package Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm). Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM). Includes: - Backend registration with build tags (linux/amd64) - Stub backend.go with llama-server lifecycle outline - CLAUDE.md with build instructions for llama.cpp + ROCm - TODO.md with 5-phase task queue - FINDINGS.md with hardware specs, VRAM budget, design rationale Co-Authored-By: Virgil <virgil@lethean.io> 2026-02-19 19:39:40 +00:00			`- OS: Ubuntu 24.04 LTS (linux/amd64)`
			`- ROCm: 6.x+ (gfx1100/gfx1101 officially supported)`
			`- Kernel: 6.10+ recommended for RDNA 3 stability`

			`## Commands`

			```bash
			`go test ./... # Run all tests (stubs on non-Linux)`
			`go test -tags rocm ./... # Run with ROCm integration tests`

			`# On the Linux homelab:`
			`go test -v -run TestROCm ./... # Full GPU tests`
			```

			`## Architecture`

			```
			`go-rocm (this package)`
			`├── rocm.go Package doc`
			`├── register_rocm.go //go:build linux && amd64 — auto-registers via init()`
			`├── rocm_stub.go //go:build !linux \|\| !amd64 — ROCmAvailable() false`
			`├── backend.go inference.Backend implementation`
			`├── model.go inference.TextModel implementation (TODO)`
			`├── server.go llama-server lifecycle management (TODO)`
			`└── internal/`
			`└── llamacpp/ llama-server HTTP client (TODO)`
			`├── client.go OpenAI-compatible API client`
			`└── health.go Health check + readiness probe`
			```

			`### How It Works`

			1. `LoadModel()` spawns `llama-server` (llama.cpp) as a subprocess
			`2. llama-server loads the GGUF model onto the AMD GPU via HIP/ROCm`
			3. `Generate()` / `Chat()` make HTTP requests to llama-server's OpenAI-compatible API
			`4. Token streaming via SSE (Server-Sent Events) from llama-server`
			5. `Close()` sends SIGTERM to llama-server, waits for clean exit

			`This is the subprocess approach (not CGO). It's simpler, more maintainable, and llama.cpp's server mode is battle-tested.`

			`### Dependencies`

			- `forge.lthn.ai/core/go-inference` — shared TextModel/Backend interfaces
			- llama-server binary (external, not Go dependency) built with `-DGGML_HIP=ON`

			`## Building llama-server with ROCm`

			```bash
			`# On the Linux homelab:`
			`sudo apt install rocm-dev rocm-libs # ROCm 6.x`

			`git clone https://github.com/ggml-org/llama.cpp`
			`cd llama.cpp`
			`cmake -B build \`
			`-DGGML_HIP=ON \`
			`-DAMDGPU_TARGETS=gfx1100 \`
			`-DGGML_HIP_ROCWMMA_FATTN=ON \`
			`-DCMAKE_BUILD_TYPE=Release`
			`cmake --build build --parallel -t llama-server`

			`# Binary at build/bin/llama-server`
			`# Copy to /usr/local/bin/ or set ROCM_LLAMA_SERVER_PATH`
			```

			`### Performance Tip`

			`The RX 7800 XT is gfx1101 but the ROCm compiler generates identical code for gfx1100. Setting:`
			```bash
			`export HSA_OVERRIDE_GFX_VERSION=11.0.0`
			```
			`...gives better performance on some ROCm versions. Benchmark both.`

			`## Coding Standards`

			`- UK English`
			`- Tests: testify assert/require`
			`- Conventional commits`
			- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
			`- Licence: EUPL-1.2`

			`## Coordination`

			`- Virgil (core/go) is the orchestrator — writes tasks here`
			`- go-mlx Claude is the sibling — Metal backend on macOS, same interface contract`
			`- go-inference defines the shared TextModel/Backend interfaces both backends implement`
			`- go-ml wraps both backends into the scoring engine`

			`## Task Queue`

			See `TODO.md` for prioritised work.
			See `FINDINGS.md` for research notes.