From 68bc7300aad31bf7890956bcb3edc5afabec74d5 Mon Sep 17 00:00:00 2001
From: Claude <developers@lethean.io>
Date: Thu, 19 Feb 2026 19:57:14 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20Phase=200=20complete=20=E2=80=94=20envi?=
 =?UTF-8?q?ronment=20validated,=20llama-server=20built?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17
- llama-server built with HIP from llama.cpp 11c325c
- Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill
- Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0
- All Phase 0 tasks marked done

Co-Authored-By: Virgil <virgil@lethean.io>
---
 CLAUDE.md   |  2 +-
 FINDINGS.md | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 TODO.md     | 15 ++++++----
 3 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 7beebdc..c4e1efa 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -8,7 +8,7 @@ Implements `inference.Backend` and `inference.TextModel` (from `core/go-inferenc
 
 ## Target Hardware
 
-- **GPU**: AMD Radeon RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM)
+- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101
 - **OS**: Ubuntu 24.04 LTS (linux/amd64)
 - **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
 - **Kernel**: 6.10+ recommended for RDNA 3 stability
diff --git a/FINDINGS.md b/FINDINGS.md
index 39ee8e1..bf7c94f 100644
--- a/FINDINGS.md
+++ b/FINDINGS.md
@@ -92,3 +92,83 @@ The go-mlx package uses direct CGO because MLX is a C library designed for embed
 | Chat templates | Built into model code | llama-server `--chat-template` |
 
 Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
+
+---
+
+## 2026-02-19: Phase 0 Environment Validation (Charon)
+
+### Actual Hardware (corrected from Virgil's notes)
+
+- **GPU arch**: gfx1100 (NOT gfx1101 — `rocminfo` confirms)
+- **ROCm version**: 7.2.0 (newer than the 6.x minimum)
+- **Kernel**: 6.17.0-14-generic
+- **`/dev/kfd`**: Present, working
+- **HSA_OVERRIDE_GFX_VERSION**: Not needed — native gfx1100
+
+### llama-server Build
+
+- **Source**: llama.cpp commit `11c325c` (cloned 19 Feb 2026)
+- **Local build path**: `/home/claude/llama.cpp/build/bin/llama-server`
+- **Installed to**: `/usr/local/bin/llama-server`
+- **Build command**:
+  ```bash
+  cmake -B build \
+      -DGGML_HIP=ON \
+      -DAMDGPU_TARGETS=gfx1100 \
+      -DGGML_HIP_ROCWMMA_FATTN=ON \
+      -DCMAKE_BUILD_TYPE=Release
+  cmake --build build --parallel $(nproc) -t llama-server
+  ```
+
+### Critical: iGPU Crash
+
+**The Ryzen 9 9950X has an integrated GPU** that ROCm detects as a second device:
+- Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real)
+- Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading)
+
+llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes **`ROCm error: unspecified launch failure`** and crashes with a core dump.
+
+**Fix**: Set `HIP_VISIBLE_DEVICES=0` to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server.
+
+### Baseline Benchmarks — Gemma3-4B-Q4_K_M
+
+| Metric | Value |
+|--------|-------|
+| Model | LEK-Gemma3-4B-Q4_K_M (2.66 GiB) |
+| VRAM used | ~3.4 GiB of 16 GiB |
+| Prefill (prompt) | 396 tok/s (2.5ms/tok) |
+| Decode (generation) | 109 tok/s (9.2ms/tok) |
+| Time to first token | ~40ms (16 token prompt) |
+| Startup time | ~6s (load + warmup) |
+| Context window | 4096 (model supports 131072) |
+| Flash attention | Auto-enabled |
+| Slots | 4 concurrent |
+
+### GGUF Models Available
+
+All at `/data/lem/gguf/` (SMB mount from M3):
+
+| Model | Size | Fits 16GB? |
+|-------|------|-----------|
+| LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes |
+| LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes |
+| LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes |
+| LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes |
+| LEK-Gemma3-27B-Q4_K_M | ~16G | Tight |
+| LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes |
+| LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes |
+| LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes |
+
+### Environment Variables for go-rocm
+
+The server.go implementation MUST set these when spawning:
+
+```go
+cmd.Env = append(os.Environ(),
+    "HIP_VISIBLE_DEVICES=0",  // Critical: mask iGPU to prevent crash
+)
+```
+
+### Model Path Note
+
+Models are on SMB mount (`/data` = `//10.69.69.108/Data`). For CI/testing, copy a small model locally or use `t.Skip()` when the mount is unavailable.
diff --git a/TODO.md b/TODO.md
index 4d24e4e..90f2d70 100644
--- a/TODO.md
+++ b/TODO.md
@@ -6,10 +6,14 @@ Dispatched from core/go orchestration. Pick up tasks in order.
 
 ## Phase 0: Environment Setup (on Linux homelab)
 
-- [ ] **Install ROCm 6.x** — Follow [ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/). Ubuntu 24.04 recommended. Verify with `rocm-smi` showing RX 7800 XT.
-- [ ] **Build llama-server with HIP** — Clone llama.cpp, build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Verify binary runs: `llama-server --help`.
-- [ ] **Test manual inference** — Download a GGUF model (e.g. Qwen3-8B-Q4_K_M). Run `llama-server --model /path/to/model.gguf -ngl 99`. Test with curl against the OpenAI-compatible API. Record tokens/sec.
-- [ ] **HSA_OVERRIDE_GFX_VERSION benchmark** — Test with `11.0.0` vs `11.0.1` vs unset. The RX 7800 XT is gfx1101 but gfx1100 codegen may be faster. Record results in FINDINGS.md.
+- [x] **Install ROCm 6.x** — ROCm 7.2.0 already installed. `rocm-smi` shows RX 7800 XT (gfx1100). Kernel 6.17.0. (Charon, 19 Feb 2026)
+- [x] **Build llama-server with HIP** — Built from llama.cpp `11c325c`. Installed to `/usr/local/bin/llama-server`. (Charon, 19 Feb 2026)
+- [x] **Test manual inference** — Gemma3-4B-Q4_K_M: 109 tok/s decode, 396 tok/s prefill. See FINDINGS.md for full results. (Charon, 19 Feb 2026)
+- [x] **HSA_OVERRIDE_GFX_VERSION benchmark** — N/A: GPU is actually gfx1100 (not gfx1101 as Virgil noted). No override needed. (Charon, 19 Feb 2026)
+
+### Critical Discovery: iGPU Crash
+
+The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM), and crashes llama-server when it tries to split tensors across devices. **`HIP_VISIBLE_DEVICES=0` is REQUIRED** when spawning llama-server. See FINDINGS.md for details.
 
 ## Phase 1: Core Implementation
 
@@ -58,7 +62,8 @@ Download to `/data/models/` (or wherever the homelab stores data):
 | Variable | Default | Purpose |
 |----------|---------|---------|
 | `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
-| `HSA_OVERRIDE_GFX_VERSION` | unset | Override GPU arch for ROCm compiler |
+| `HIP_VISIBLE_DEVICES` | `0` (MUST set) | Mask iGPU — Ryzen 9 iGPU crashes llama-server |
+| `HSA_OVERRIDE_GFX_VERSION` | unset | Not needed (GPU is native gfx1100) |
 | `ROCM_MODEL_DIR` | none | Default directory for model discovery |
 
 ## Upstream Dependencies