docs: Phase 0 complete — environment validated, llama-server built
- ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17 - llama-server built with HIP from llama.cpp 11c325c - Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill - Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0 - All Phase 0 tasks marked done Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
parent
aa42cff417
commit
68bc7300aa
3 changed files with 91 additions and 6 deletions
|
|
@ -8,7 +8,7 @@ Implements `inference.Backend` and `inference.TextModel` (from `core/go-inferenc
|
||||||
|
|
||||||
## Target Hardware
|
## Target Hardware
|
||||||
|
|
||||||
- **GPU**: AMD Radeon RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM)
|
- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16GB VRAM) — NOTE: gfx1100 not gfx1101
|
||||||
- **OS**: Ubuntu 24.04 LTS (linux/amd64)
|
- **OS**: Ubuntu 24.04 LTS (linux/amd64)
|
||||||
- **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
|
- **ROCm**: 6.x+ (gfx1100/gfx1101 officially supported)
|
||||||
- **Kernel**: 6.10+ recommended for RDNA 3 stability
|
- **Kernel**: 6.10+ recommended for RDNA 3 stability
|
||||||
|
|
|
||||||
80
FINDINGS.md
80
FINDINGS.md
|
|
@ -92,3 +92,83 @@ The go-mlx package uses direct CGO because MLX is a C library designed for embed
|
||||||
| Chat templates | Built into model code | llama-server `--chat-template` |
|
| Chat templates | Built into model code | llama-server `--chat-template` |
|
||||||
|
|
||||||
Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
|
Both register as `inference.Backend` via build-tagged `init()`. go-ml wraps both transparently.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-02-19: Phase 0 Environment Validation (Charon)
|
||||||
|
|
||||||
|
### Actual Hardware (corrected from Virgil's notes)
|
||||||
|
|
||||||
|
- **GPU arch**: gfx1100 (NOT gfx1101 — `rocminfo` confirms)
|
||||||
|
- **ROCm version**: 7.2.0 (newer than the 6.x minimum)
|
||||||
|
- **Kernel**: 6.17.0-14-generic
|
||||||
|
- **`/dev/kfd`**: Present, working
|
||||||
|
- **HSA_OVERRIDE_GFX_VERSION**: Not needed — native gfx1100
|
||||||
|
|
||||||
|
### llama-server Build
|
||||||
|
|
||||||
|
- **Source**: llama.cpp commit `11c325c` (cloned 19 Feb 2026)
|
||||||
|
- **Local build path**: `/home/claude/llama.cpp/build/bin/llama-server`
|
||||||
|
- **Installed to**: `/usr/local/bin/llama-server`
|
||||||
|
- **Build command**:
|
||||||
|
```bash
|
||||||
|
cmake -B build \
|
||||||
|
-DGGML_HIP=ON \
|
||||||
|
-DAMDGPU_TARGETS=gfx1100 \
|
||||||
|
-DGGML_HIP_ROCWMMA_FATTN=ON \
|
||||||
|
-DCMAKE_BUILD_TYPE=Release
|
||||||
|
cmake --build build --parallel $(nproc) -t llama-server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Critical: iGPU Crash
|
||||||
|
|
||||||
|
**The Ryzen 9 9950X has an integrated GPU** that ROCm detects as a second device:
|
||||||
|
- Device 0: AMD Radeon RX 7800 XT (gfx1100) — 16GB VRAM (real)
|
||||||
|
- Device 1: AMD Radeon Graphics (gfx1100) — reports 100GB free (system RAM, misleading)
|
||||||
|
|
||||||
|
llama-server's auto-fit logic tries to split the model across both devices. Loading tensors to Device 1 (iGPU) causes **`ROCm error: unspecified launch failure`** and crashes with a core dump.
|
||||||
|
|
||||||
|
**Fix**: Set `HIP_VISIBLE_DEVICES=0` to mask the iGPU. The go-rocm package MUST set this env var before spawning llama-server.
|
||||||
|
|
||||||
|
### Baseline Benchmarks — Gemma3-4B-Q4_K_M
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Model | LEK-Gemma3-4B-Q4_K_M (2.66 GiB) |
|
||||||
|
| VRAM used | ~3.4 GiB of 16 GiB |
|
||||||
|
| Prefill (prompt) | 396 tok/s (2.5ms/tok) |
|
||||||
|
| Decode (generation) | 109 tok/s (9.2ms/tok) |
|
||||||
|
| Time to first token | ~40ms (16 token prompt) |
|
||||||
|
| Startup time | ~6s (load + warmup) |
|
||||||
|
| Context window | 4096 (model supports 131072) |
|
||||||
|
| Flash attention | Auto-enabled |
|
||||||
|
| Slots | 4 concurrent |
|
||||||
|
|
||||||
|
### GGUF Models Available
|
||||||
|
|
||||||
|
All at `/data/lem/gguf/` (SMB mount from M3):
|
||||||
|
|
||||||
|
| Model | Size | Fits 16GB? |
|
||||||
|
|-------|------|-----------|
|
||||||
|
| LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes |
|
||||||
|
| LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes |
|
||||||
|
| LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes |
|
||||||
|
| LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes |
|
||||||
|
| LEK-Gemma3-27B-Q4_K_M | ~16G | Tight |
|
||||||
|
| LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes |
|
||||||
|
| LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes |
|
||||||
|
| LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes |
|
||||||
|
|
||||||
|
### Environment Variables for go-rocm
|
||||||
|
|
||||||
|
The server.go implementation MUST set these when spawning:
|
||||||
|
|
||||||
|
```go
|
||||||
|
cmd.Env = append(os.Environ(),
|
||||||
|
"HIP_VISIBLE_DEVICES=0", // Critical: mask iGPU to prevent crash
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model Path Note
|
||||||
|
|
||||||
|
Models are on SMB mount (`/data` = `//10.69.69.108/Data`). For CI/testing, copy a small model locally or use `t.Skip()` when the mount is unavailable.
|
||||||
|
|
|
||||||
15
TODO.md
15
TODO.md
|
|
@ -6,10 +6,14 @@ Dispatched from core/go orchestration. Pick up tasks in order.
|
||||||
|
|
||||||
## Phase 0: Environment Setup (on Linux homelab)
|
## Phase 0: Environment Setup (on Linux homelab)
|
||||||
|
|
||||||
- [ ] **Install ROCm 6.x** — Follow [ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/). Ubuntu 24.04 recommended. Verify with `rocm-smi` showing RX 7800 XT.
|
- [x] **Install ROCm 6.x** — ROCm 7.2.0 already installed. `rocm-smi` shows RX 7800 XT (gfx1100). Kernel 6.17.0. (Charon, 19 Feb 2026)
|
||||||
- [ ] **Build llama-server with HIP** — Clone llama.cpp, build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON`. Verify binary runs: `llama-server --help`.
|
- [x] **Build llama-server with HIP** — Built from llama.cpp `11c325c`. Installed to `/usr/local/bin/llama-server`. (Charon, 19 Feb 2026)
|
||||||
- [ ] **Test manual inference** — Download a GGUF model (e.g. Qwen3-8B-Q4_K_M). Run `llama-server --model /path/to/model.gguf -ngl 99`. Test with curl against the OpenAI-compatible API. Record tokens/sec.
|
- [x] **Test manual inference** — Gemma3-4B-Q4_K_M: 109 tok/s decode, 396 tok/s prefill. See FINDINGS.md for full results. (Charon, 19 Feb 2026)
|
||||||
- [ ] **HSA_OVERRIDE_GFX_VERSION benchmark** — Test with `11.0.0` vs `11.0.1` vs unset. The RX 7800 XT is gfx1101 but gfx1100 codegen may be faster. Record results in FINDINGS.md.
|
- [x] **HSA_OVERRIDE_GFX_VERSION benchmark** — N/A: GPU is actually gfx1100 (not gfx1101 as Virgil noted). No override needed. (Charon, 19 Feb 2026)
|
||||||
|
|
||||||
|
### Critical Discovery: iGPU Crash
|
||||||
|
|
||||||
|
The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM), and crashes llama-server when it tries to split tensors across devices. **`HIP_VISIBLE_DEVICES=0` is REQUIRED** when spawning llama-server. See FINDINGS.md for details.
|
||||||
|
|
||||||
## Phase 1: Core Implementation
|
## Phase 1: Core Implementation
|
||||||
|
|
||||||
|
|
@ -58,7 +62,8 @@ Download to `/data/models/` (or wherever the homelab stores data):
|
||||||
| Variable | Default | Purpose |
|
| Variable | Default | Purpose |
|
||||||
|----------|---------|---------|
|
|----------|---------|---------|
|
||||||
| `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
|
| `ROCM_LLAMA_SERVER_PATH` | `llama-server` (PATH lookup) | Path to llama-server binary |
|
||||||
| `HSA_OVERRIDE_GFX_VERSION` | unset | Override GPU arch for ROCm compiler |
|
| `HIP_VISIBLE_DEVICES` | `0` (MUST set) | Mask iGPU — Ryzen 9 iGPU crashes llama-server |
|
||||||
|
| `HSA_OVERRIDE_GFX_VERSION` | unset | Not needed (GPU is native gfx1100) |
|
||||||
| `ROCM_MODEL_DIR` | none | Default directory for model discovery |
|
| `ROCM_MODEL_DIR` | none | Default directory for model discovery |
|
||||||
|
|
||||||
## Upstream Dependencies
|
## Upstream Dependencies
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue