docs: Phase 0 complete — environment validated, llama-server built

- ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17 - llama-server built with HIP from llama.cpp 11c325c - Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill - Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0 - All Phase 0 tasks marked done Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:57:14 +00:00 · 2026-02-19 19:57:14 +00:00 · 68bc7300aa
commit 68bc7300aa
parent aa42cff417
3 changed files with 91 additions and 6 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -8,7 +8,7 @@ Implements `inference.Backend` and `inference.TextModel` (from `core/go-inferenc
 ## Target Hardware
- **GPU**: AMD Radeon RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM)
+- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101
 - **OS**: Ubuntu 24.04 LTS (linux/amd64)
 - **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
 - **Kernel**: 6.10+ recommended for RDNA 3 stability
--- a/FINDINGS.md
+++ b/FINDINGS.md
@ -92,3 +92,83 @@ The go-mlx package uses direct CGO because MLX is a C library designed for embed
 | Chat templates | Built into model code | llama-server `--chat-template` |
 Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
 ---
 ## 2026-02-19: Phase 0 Environment Validation (Charon)
 ### Actual Hardware (corrected from Virgil's notes)
 - **GPU arch**: gfx1100 (NOT gfx1101 — `rocminfo` confirms)
 - **ROCm version**: 7.2.0 (newer than the 6.x minimum)
 - **Kernel**: 6.17.0-14-generic
 - **`/dev/kfd`**: Present, working
 - **HSA_OVERRIDE_GFX_VERSION**: Not needed — native gfx1100
 ### llama-server Build
 - **Source**: llama.cpp commit `11c325c` (cloned 19 Feb 2026)
 - **Local build path**: `/home/claude/llama.cpp/build/bin/llama-server`
 - **Installed to**: `/usr/local/bin/llama-server`
 - **Build command**:
  ```bash
  cmake -B build \
      -DGGML_HIP=ON \
      -DAMDGPU_TARGETS=gfx1100 \
      -DGGML_HIP_ROCWMMA_FATTN=ON \
      -DCMAKE_BUILD_TYPE=Release
  cmake --build build --parallel $(nproc) -t llama-server
  ```
 ### Critical: iGPU Crash
 **The Ryzen 9 9950X has an integrated GPU** that ROCm detects as a second device:
 - Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real)
 - Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading)
 llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes **`ROCm error: unspecified launch failure`** and crashes with a core dump.
 **Fix**: Set `HIP_VISIBLE_DEVICES=0` to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server.
 ### Baseline Benchmarks — Gemma3-4B-Q4_K_M
 | Metric | Value |
 |--------|-------|
 | Model | LEK-Gemma3-4B-Q4_K_M (2.66 GiB) |
 | VRAM used | ~3.4 GiB of 16 GiB |
 | Prefill (prompt) | 396 tok/s (2.5ms/tok) |
 | Decode (generation) | 109 tok/s (9.2ms/tok) |
 | Time to first token | ~40ms (16 token prompt) |
 | Startup time | ~6s (load + warmup) |
 | Context window | 4096 (model supports 131072) |
 | Flash attention | Auto-enabled |
 | Slots | 4 concurrent |
 ### GGUF Models Available
 All at `/data/lem/gguf/` (SMB mount from M3):
 | Model | Size | Fits 16GB? |
 |-------|------|-----------|
 | LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes |
 | LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes |
 | LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes |
 | LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes |
 | LEK-Gemma3-27B-Q4_K_M | ~16G | Tight |
 | LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes |
 | LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes |
 | LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes |
 ### Environment Variables for go-rocm
 The server.go implementation MUST set these when spawning:
 ```go
 cmd.Env = append(os.Environ(),
    "HIP_VISIBLE_DEVICES=0",  // Critical: mask iGPU to prevent crash
 )
 ```
 ### Model Path Note
 Models are on SMB mount (`/data` = `//10.69.69.108/Data`). For CI/testing, copy a small model locally or use `t.Skip()` when the mount is unavailable.
--- a/TODO.md
+++ b/TODO.md
@ -6,10 +6,14 @@ Dispatched from core/go orchestration. Pick up tasks in order.
 ## Phase 0: Environment Setup (on Linux homelab)
- [ ] **Install ROCm 6.x** — Follow [ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/). Ubuntu 24.04 recommended. Verify with `rocm-smi` showing RX 7800 XT.
+- [x] **Install ROCm 6.x** — ROCm 7.2.0 already installed. `rocm-smi` shows RX 7800 XT (gfx1100). Kernel 6.17.0. (Charon, 19 Feb 2026)
- [ ] **Build llama-server with HIP** — Clone llama.cpp, build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Verify binary runs: `llama-server --help`.
+- [x] **Build llama-server with HIP** — Built from llama.cpp `11c325c`. Installed to `/usr/local/bin/llama-server`. (Charon, 19 Feb 2026)
- [ ] **Test manual inference** — Download a GGUF model (e.g. Qwen3-8B-Q4_K_M). Run `llama-server --model /path/to/model.gguf -ngl 99`. Test with curl against the OpenAI-compatible API. Record tokens/sec.
+- [x] **Test manual inference** — Gemma3-4B-Q4_K_M: 109 tok/s decode, 396 tok/s prefill. See FINDINGS.md for full results. (Charon, 19 Feb 2026)
- [ ] **HSA_OVERRIDE_GFX_VERSION benchmark** — Test with `11.0.0` vs `11.0.1` vs unset. The RX 7800 XT is gfx1101 but gfx1100 codegen may be faster. Record results in FINDINGS.md.
+- [x] **HSA_OVERRIDE_GFX_VERSION benchmark** — N/A: GPU is actually gfx1100 (not gfx1101 as Virgil noted). No override needed. (Charon, 19 Feb 2026)
 ### Critical Discovery: iGPU Crash
 The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM), and crashes llama-server when it tries to split tensors across devices. **`HIP_VISIBLE_DEVICES=0` is REQUIRED** when spawning llama-server. See FINDINGS.md for details.
 ## Phase 1: Core Implementation
@ -58,7 +62,8 @@ Download to `/data/models/` (or wherever the homelab stores data):
 | Variable | Default | Purpose |
 |----------|---------|---------|
 | `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
-| `HSA_OVERRIDE_GFX_VERSION` | unset | Override GPU arch for ROCm compiler |
+| `HIP_VISIBLE_DEVICES` | `0` (MUST set) | Mask iGPU — Ryzen 9 iGPU crashes llama-server |
 | `HSA_OVERRIDE_GFX_VERSION` | unset | Not needed (GPU is native gfx1100) |
 | `ROCM_MODEL_DIR` | none | Default directory for model discovery |
 ## Upstream Dependencies