From 31bf0e88507872e6d46cc522dd567903706e5e97 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 19 Feb 2026 23:09:56 +0000 Subject: [PATCH] docs: Phase 4 performance design Benchmark suite (testing.B), parallel slots via go-inference, flash attention manual comparison. Co-Authored-By: Virgil --- .../2026-02-19-phase4-performance-design.md | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 docs/plans/2026-02-19-phase4-performance-design.md diff --git a/docs/plans/2026-02-19-phase4-performance-design.md b/docs/plans/2026-02-19-phase4-performance-design.md new file mode 100644 index 0000000..83f2a0c --- /dev/null +++ b/docs/plans/2026-02-19-phase4-performance-design.md @@ -0,0 +1,29 @@ +# Phase 4: Performance Design + +Approved 19 Feb 2026. + +## 1. Benchmark Suite (`rocm_benchmark_test.go`) + +Go benchmark tests (`testing.B`) with `b.ReportMetric()`, build-tagged `//go:build rocm`. + +Three benchmarks per model: + +- **BenchmarkDecode_{Model}** — Generate 128 tokens, report tok/s +- **BenchmarkTTFT_{Model}** — Time-to-first-token latency (ns) +- **BenchmarkConcurrent_{Model}** — N goroutines generating simultaneously, report aggregate tok/s + +Models: Gemma3-4B-Q4_K_M, Llama-3.1-8B-Q4_K_M, Qwen-2.5-7B-Q4_K_M. Each loads once per benchmark function (load time excluded via `b.StopTimer`/`b.StartTimer`). VRAM logged after load via `GetVRAMInfo()`. + +Run: `go test -tags rocm -bench . -benchtime 3x -timeout 600s` + +## 2. Flash Attention (Manual) + +No code changes. Build a second llama-server without `-DGGML_HIP_ROCWMMA_FATTN=ON`, run the benchmark suite twice with `ROCM_LLAMA_SERVER_PATH` pointing at each binary, record results in FINDINGS.md. + +## 3. Parallel Slots + +Add `ParallelSlots int` to go-inference's `LoadConfig` and `WithParallelSlots(n int) LoadOption`. Pass `--parallel N` to llama-server in `startServer` when > 0. Benchmark concurrent requests with 1 slot vs 4 slots. + +## 4. Testing + +Benchmarks are the tests — they produce the numbers. Existing integration tests remain as correctness checks. All results documented in FINDINGS.md.