docs: Phase 0 complete — environment validated, llama-server built

- ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17
- llama-server built with HIP from llama.cpp 11c325c
- Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill
- Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0
- All Phase 0 tasks marked done

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Claude 2026-02-19 19:57:14 +00:00
parent aa42cff417
commit 68bc7300aa
No known key found for this signature in database
GPG key ID: AF404715446AEB41
3 changed files with 91 additions and 6 deletions

View file

@ -8,7 +8,7 @@ Implements `inference.Backend` and `inference.TextModel` (from `core/go-inferenc
## Target Hardware ## Target Hardware
- **GPU**: AMD Radeon RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM) - **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101
- **OS**: Ubuntu 24.04 LTS (linux/amd64) - **OS**: Ubuntu 24.04 LTS (linux/amd64)
- **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported) - **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
- **Kernel**: 6.10+ recommended for RDNA 3 stability - **Kernel**: 6.10+ recommended for RDNA 3 stability

View file

@ -92,3 +92,83 @@ The go-mlx package uses direct CGO because MLX is a C library designed for embed
| Chat templates | Built into model code | llama-server `--chat-template` | | Chat templates | Built into model code | llama-server `--chat-template` |
Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently. Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
---
## 2026-02-19: Phase 0 Environment Validation (Charon)
### Actual Hardware (corrected from Virgil's notes)
- **GPU arch**: gfx1100 (NOT gfx1101 — `rocminfo` confirms)
- **ROCm version**: 7.2.0 (newer than the 6.x minimum)
- **Kernel**: 6.17.0-14-generic
- **`/dev/kfd`**: Present, working
- **HSA_OVERRIDE_GFX_VERSION**: Not needed — native gfx1100
### llama-server Build
- **Source**: llama.cpp commit `11c325c` (cloned 19 Feb 2026)
- **Local build path**: `/home/claude/llama.cpp/build/bin/llama-server`
- **Installed to**: `/usr/local/bin/llama-server`
- **Build command**:
```bash
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1100 \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc) -t llama-server
```
### Critical: iGPU Crash
**The Ryzen 9 9950X has an integrated GPU** that ROCm detects as a second device:
- Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real)
- Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading)
llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes **`ROCm error: unspecified launch failure`** and crashes with a core dump.
**Fix**: Set `HIP_VISIBLE_DEVICES=0` to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server.
### Baseline Benchmarks — Gemma3-4B-Q4_K_M
| Metric | Value |
|--------|-------|
| Model | LEK-Gemma3-4B-Q4_K_M (2.66 GiB) |
| VRAM used | ~3.4 GiB of 16 GiB |
| Prefill (prompt) | 396 tok/s (2.5ms/tok) |
| Decode (generation) | 109 tok/s (9.2ms/tok) |
| Time to first token | ~40ms (16 token prompt) |
| Startup time | ~6s (load + warmup) |
| Context window | 4096 (model supports 131072) |
| Flash attention | Auto-enabled |
| Slots | 4 concurrent |
### GGUF Models Available
All at `/data/lem/gguf/` (SMB mount from M3):
| Model | Size | Fits 16GB? |
|-------|------|-----------|
| LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes |
| LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes |
| LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes |
| LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes |
| LEK-Gemma3-27B-Q4_K_M | ~16G | Tight |
| LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes |
| LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes |
| LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes |
### Environment Variables for go-rocm
The server.go implementation MUST set these when spawning:
```go
cmd.Env = append(os.Environ(),
"HIP_VISIBLE_DEVICES=0", // Critical: mask iGPU to prevent crash
)
```
### Model Path Note
Models are on SMB mount (`/data` = `//10.69.69.108/Data`). For CI/testing, copy a small model locally or use `t.Skip()` when the mount is unavailable.

15
TODO.md
View file

@ -6,10 +6,14 @@ Dispatched from core/go orchestration. Pick up tasks in order.
## Phase 0: Environment Setup (on Linux homelab) ## Phase 0: Environment Setup (on Linux homelab)
- [ ] **Install ROCm 6.x** — Follow [ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/). Ubuntu 24.04 recommended. Verify with `rocm-smi` showing RX 7800 XT. - [x] **Install ROCm 6.x** — ROCm 7.2.0 already installed. `rocm-smi` shows RX 7800 XT (gfx1100). Kernel 6.17.0. (Charon, 19 Feb 2026)
- [ ] **Build llama-server with HIP** — Clone llama.cpp, build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Verify binary runs: `llama-server --help`. - [x] **Build llama-server with HIP** — Built from llama.cpp `11c325c`. Installed to `/usr/local/bin/llama-server`. (Charon, 19 Feb 2026)
- [ ] **Test manual inference** — Download a GGUF model (e.g. Qwen3-8B-Q4_K_M). Run `llama-server --model /path/to/model.gguf -ngl 99`. Test with curl against the OpenAI-compatible API. Record tokens/sec. - [x] **Test manual inference** — Gemma3-4B-Q4_K_M: 109 tok/s decode, 396 tok/s prefill. See FINDINGS.md for full results. (Charon, 19 Feb 2026)
- [ ] **HSA_OVERRIDE_GFX_VERSION benchmark** — Test with `11.0.0` vs `11.0.1` vs unset. The RX 7800 XT is gfx1101 but gfx1100 codegen may be faster. Record results in FINDINGS.md. - [x] **HSA_OVERRIDE_GFX_VERSION benchmark** — N/A: GPU is actually gfx1100 (not gfx1101 as Virgil noted). No override needed. (Charon, 19 Feb 2026)
### Critical Discovery: iGPU Crash
The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM), and crashes llama-server when it tries to split tensors across devices. **`HIP_VISIBLE_DEVICES=0` is REQUIRED** when spawning llama-server. See FINDINGS.md for details.
## Phase 1: Core Implementation ## Phase 1: Core Implementation
@ -58,7 +62,8 @@ Download to `/data/models/` (or wherever the homelab stores data):
| Variable | Default | Purpose | | Variable | Default | Purpose |
|----------|---------|---------| |----------|---------|---------|
| `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary | | `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
| `HSA_OVERRIDE_GFX_VERSION` | unset | Override GPU arch for ROCm compiler | | `HIP_VISIBLE_DEVICES` | `0` (MUST set) | Mask iGPU — Ryzen 9 iGPU crashes llama-server |
| `HSA_OVERRIDE_GFX_VERSION` | unset | Not needed (GPU is native gfx1100) |
| `ROCM_MODEL_DIR` | none | Default directory for model discovery | | `ROCM_MODEL_DIR` | none | Default directory for model discovery |
## Upstream Dependencies ## Upstream Dependencies