From 68bc7300aad31bf7890956bcb3edc5afabec74d5 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 19 Feb 2026 19:57:14 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20Phase=200=20complete=20=E2=80=94=20envi?= =?UTF-8?q?ronment=20validated,=20llama-server=20built?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17 - llama-server built with HIP from llama.cpp 11c325c - Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill - Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0 - All Phase 0 tasks marked done Co-Authored-By: Virgil --- CLAUDE.md | 2 +- FINDINGS.md | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++ TODO.md | 15 ++++++---- 3 files changed, 91 insertions(+), 6 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 7beebdc..c4e1efa 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -8,7 +8,7 @@ Implements `inference.Backend` and `inference.TextModel` (from `core/go-inferenc ## Target Hardware -- **GPU**: AMD Radeon RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM) +- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101 - **OS**: Ubuntu 24.04 LTS (linux/amd64) - **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported) - **Kernel**: 6.10+ recommended for RDNA 3 stability diff --git a/FINDINGS.md b/FINDINGS.md index 39ee8e1..bf7c94f 100644 --- a/FINDINGS.md +++ b/FINDINGS.md @@ -92,3 +92,83 @@ The go-mlx package uses direct CGO because MLX is a C library designed for embed | Chat templates | Built into model code | llama-server `--chat-template` | Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently. + +--- + +## 2026-02-19: Phase 0 Environment Validation (Charon) + +### Actual Hardware (corrected from Virgil's notes) + +- **GPU arch**: gfx1100 (NOT gfx1101 — `rocminfo` confirms) +- **ROCm version**: 7.2.0 (newer than the 6.x minimum) +- **Kernel**: 6.17.0-14-generic +- **`/dev/kfd`**: Present, working +- **HSA_OVERRIDE_GFX_VERSION**: Not needed — native gfx1100 + +### llama-server Build + +- **Source**: llama.cpp commit `11c325c` (cloned 19 Feb 2026) +- **Local build path**: `/home/claude/llama.cpp/build/bin/llama-server` +- **Installed to**: `/usr/local/bin/llama-server` +- **Build command**: + ```bash + cmake -B build \ + -DGGML_HIP=ON \ + -DAMDGPU_TARGETS=gfx1100 \ + -DGGML_HIP_ROCWMMA_FATTN=ON \ + -DCMAKE_BUILD_TYPE=Release + cmake --build build --parallel $(nproc) -t llama-server + ``` + +### Critical: iGPU Crash + +**The Ryzen 9 9950X has an integrated GPU** that ROCm detects as a second device: +- Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real) +- Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading) + +llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes **`ROCm error: unspecified launch failure`** and crashes with a core dump. + +**Fix**: Set `HIP_VISIBLE_DEVICES=0` to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server. + +### Baseline Benchmarks — Gemma3-4B-Q4_K_M + +| Metric | Value | +|--------|-------| +| Model | LEK-Gemma3-4B-Q4_K_M (2.66 GiB) | +| VRAM used | ~3.4 GiB of 16 GiB | +| Prefill (prompt) | 396 tok/s (2.5ms/tok) | +| Decode (generation) | 109 tok/s (9.2ms/tok) | +| Time to first token | ~40ms (16 token prompt) | +| Startup time | ~6s (load + warmup) | +| Context window | 4096 (model supports 131072) | +| Flash attention | Auto-enabled | +| Slots | 4 concurrent | + +### GGUF Models Available + +All at `/data/lem/gguf/` (SMB mount from M3): + +| Model | Size | Fits 16GB? | +|-------|------|-----------| +| LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes | +| LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes | +| LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes | +| LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes | +| LEK-Gemma3-27B-Q4_K_M | ~16G | Tight | +| LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes | +| LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes | +| LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes | + +### Environment Variables for go-rocm + +The server.go implementation MUST set these when spawning: + +```go +cmd.Env = append(os.Environ(), + "HIP_VISIBLE_DEVICES=0", // Critical: mask iGPU to prevent crash +) +``` + +### Model Path Note + +Models are on SMB mount (`/data` = `//10.69.69.108/Data`). For CI/testing, copy a small model locally or use `t.Skip()` when the mount is unavailable. diff --git a/TODO.md b/TODO.md index 4d24e4e..90f2d70 100644 --- a/TODO.md +++ b/TODO.md @@ -6,10 +6,14 @@ Dispatched from core/go orchestration. Pick up tasks in order. ## Phase 0: Environment Setup (on Linux homelab) -- [ ] **Install ROCm 6.x** — Follow [ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/). Ubuntu 24.04 recommended. Verify with `rocm-smi` showing RX 7800 XT. -- [ ] **Build llama-server with HIP** — Clone llama.cpp, build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Verify binary runs: `llama-server --help`. -- [ ] **Test manual inference** — Download a GGUF model (e.g. Qwen3-8B-Q4_K_M). Run `llama-server --model /path/to/model.gguf -ngl 99`. Test with curl against the OpenAI-compatible API. Record tokens/sec. -- [ ] **HSA_OVERRIDE_GFX_VERSION benchmark** — Test with `11.0.0` vs `11.0.1` vs unset. The RX 7800 XT is gfx1101 but gfx1100 codegen may be faster. Record results in FINDINGS.md. +- [x] **Install ROCm 6.x** — ROCm 7.2.0 already installed. `rocm-smi` shows RX 7800 XT (gfx1100). Kernel 6.17.0. (Charon, 19 Feb 2026) +- [x] **Build llama-server with HIP** — Built from llama.cpp `11c325c`. Installed to `/usr/local/bin/llama-server`. (Charon, 19 Feb 2026) +- [x] **Test manual inference** — Gemma3-4B-Q4_K_M: 109 tok/s decode, 396 tok/s prefill. See FINDINGS.md for full results. (Charon, 19 Feb 2026) +- [x] **HSA_OVERRIDE_GFX_VERSION benchmark** — N/A: GPU is actually gfx1100 (not gfx1101 as Virgil noted). No override needed. (Charon, 19 Feb 2026) + +### Critical Discovery: iGPU Crash + +The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM), and crashes llama-server when it tries to split tensors across devices. **`HIP_VISIBLE_DEVICES=0` is REQUIRED** when spawning llama-server. See FINDINGS.md for details. ## Phase 1: Core Implementation @@ -58,7 +62,8 @@ Download to `/data/models/` (or wherever the homelab stores data): | Variable | Default | Purpose | |----------|---------|---------| | `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary | -| `HSA_OVERRIDE_GFX_VERSION` | unset | Override GPU arch for ROCm compiler | +| `HIP_VISIBLE_DEVICES` | `0` (MUST set) | Mask iGPU — Ryzen 9 iGPU crashes llama-server | +| `HSA_OVERRIDE_GFX_VERSION` | unset | Not needed (GPU is native gfx1100) | | `ROCM_MODEL_DIR` | none | Default directory for model discovery | ## Upstream Dependencies