diff --git a/FINDINGS.md b/FINDINGS.md index 602ceb1..002b673 100644 --- a/FINDINGS.md +++ b/FINDINGS.md @@ -228,3 +228,25 @@ Design and implementation plan reviewed. The layered architecture (internal/llam ``` 3. **`//go:build rocm` for integration tests** — Good call. Keeps `go test ./...` fast on machines without GPU. + +--- + +## 2026-02-19: Phase 2 Robustness (Charon) + +### Concurrent Requests + +Tested 3 goroutines calling Generate() simultaneously on the same model (Gemma3-1B, llama-server with default settings). All 3 received output (~0.9s total). llama-server handles concurrency via its slot system — default is 1 slot, so requests are serialised server-side. + +For true parallel inference, use `--parallel N` flag in llama-server (not yet configurable via go-rocm). VRAM cost scales with number of slots and context size. + +### VRAM Monitoring + +Reading sysfs directly (`/sys/class/drm/cardN/device/mem_info_vram_*`) instead of spawning `rocm-smi`. Auto-detects dGPU by selecting the card with the largest VRAM total: +- card0 = iGPU (2GB) — Ryzen 9 9950X integrated +- card1 = dGPU (16GB) — RX 7800 XT + +Note: sysfs reads are non-atomic. Total and Used are read separately, so transient inconsistencies are possible under heavy allocation churn. Free is clamped to prevent uint64 underflow. + +### lastErr Design Limitation + +`rocmModel.lastErr` is a single mutex-protected field shared across all callers. With concurrent Generate/Chat calls, errors can be clobbered (last writer wins). `Err()` is only reliable in single-caller scenarios. This matches the go-inference interface contract (single `Err() error` method), so it's a known limitation, not a bug. Per-call error returns would require an interface change in go-inference. diff --git a/TODO.md b/TODO.md index 5275df0..5dd8e1b 100644 --- a/TODO.md +++ b/TODO.md @@ -25,11 +25,11 @@ The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM ## Phase 2: Robustness -- [ ] **Server crash recovery** — If llama-server dies mid-generation, detect via process exit, return error via Err(), allow re-load. -- [ ] **Port conflict handling** — If the random port is taken, retry with a different port. -- [ ] **Graceful shutdown** — On context cancellation, stop the current request cleanly (close SSE stream), don't kill the server. Only Close() kills the server. -- [ ] **Memory monitoring** — Use `rocm-smi --showmeminfo vram` or HIP API to report VRAM usage. Expose via package-level functions (like go-mlx's GetActiveMemory). -- [ ] **Concurrent requests** — llama-server supports concurrent slots. Test with multiple goroutines calling Generate() simultaneously. Document max concurrency. +- [x] **Server crash recovery** — `server.alive()` detects process exit; Generate/Chat return error immediately if dead. Commits `2c4966e`, `c07f37a`. (Charon, 19 Feb 2026) +- [x] **Port conflict handling** — `startServer()` retries up to 3 times with new port on process exit. Only retries on exit, not timeout. Commits `c50a8e9`, `b7342ec`. (Charon, 19 Feb 2026) +- [x] **Graceful shutdown** — Already worked in Phase 1. Integration test confirms server survives context cancellation and generates again. Commit `a6e647c`. (Charon, 19 Feb 2026) +- [x] **Memory monitoring** — `GetVRAMInfo()` reads sysfs, auto-detects dGPU by largest VRAM. Uint64 underflow guard on Free. Commits `501de83`, `954c570`. (Charon, 19 Feb 2026) +- [x] **Concurrent requests** — 3 goroutines calling Generate() simultaneously all get output. llama-server serialises via 1 slot (default). Commit `a6e647c`. (Charon, 19 Feb 2026) ## Phase 3: Model Support