go-rocm/docs/plans/completed/2026-02-19-phase4-performance-design.md
Claude c0b7485129
Some checks failed
Security Scan / security (push) Successful in 10s
Test / Vet & Build (push) Failing after 29s
docs: archive completed phase 1-4 plans
All four phases fully implemented and tested on RX 7800 XT.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-24 19:42:07 +00:00

1.3 KiB

Phase 4: Performance Design

Approved 19 Feb 2026.

1. Benchmark Suite (rocm_benchmark_test.go)

Go benchmark tests (testing.B) with b.ReportMetric(), build-tagged //go:build rocm.

Three benchmarks per model:

  • BenchmarkDecode_{Model} — Generate 128 tokens, report tok/s
  • BenchmarkTTFT_{Model} — Time-to-first-token latency (ns)
  • BenchmarkConcurrent_{Model} — N goroutines generating simultaneously, report aggregate tok/s

Models: Gemma3-4B-Q4_K_M, Llama-3.1-8B-Q4_K_M, Qwen-2.5-7B-Q4_K_M. Each loads once per benchmark function (load time excluded via b.StopTimer/b.StartTimer). VRAM logged after load via GetVRAMInfo().

Run: go test -tags rocm -bench . -benchtime 3x -timeout 600s

2. Flash Attention (Manual)

No code changes. Build a second llama-server without -DGGML_HIP_ROCWMMA_FATTN=ON, run the benchmark suite twice with ROCM_LLAMA_SERVER_PATH pointing at each binary, record results in FINDINGS.md.

3. Parallel Slots

Add ParallelSlots int to go-inference's LoadConfig and WithParallelSlots(n int) LoadOption. Pass --parallel N to llama-server in startServer when > 0. Benchmark concurrent requests with 1 slot vs 4 slots.

4. Testing

Benchmarks are the tests — they produce the numbers. Existing integration tests remain as correctness checks. All results documented in FINDINGS.md.