From 1fc4c640ffc339c27294945b477a3f94974ec36e Mon Sep 17 00:00:00 2001
From: Snider <snider@host.uk.com>
Date: Fri, 20 Feb 2026 03:49:54 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20mark=20Phase=204=20test=20coverage=20co?=
 =?UTF-8?q?mplete=20=E2=80=94=20all=20phases=20done?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

25 benchmarks baselined on M3 Ultra, 34 backend tests, 6 race-condition
tests. go test -race clean.

Co-Authored-By: Virgil <virgil@lethean.io>
---
 TODO.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/TODO.md b/TODO.md
index 761929b..e16dcdf 100644
--- a/TODO.md
+++ b/TODO.md
@@ -180,12 +180,14 @@ All other consumers (service.go, judge.go, agent.go, expand.go, go-ai tools_ml.g
 
 ---
 
-## Phase 4: Test Coverage
+## Phase 4: Test Coverage — COMPLETE
 
-- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, health checks, process lifecycle.
-- [ ] **backend_mlx_test.go** — After Phase 1 rewrite, test with mock go-inference TextModel.
-- [ ] **score.go race tests** — `go test -race ./...`. Concurrent scoring, semaphore boundaries, context cancellation.
-- [ ] **Benchmark suite** — `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes.
+All 4 test files created and verified with `go test -race ./...`. Commit `09bf403`.
+
+- [x] **backend_llama_test.go** — 20 tests via httptest mock: Name, Available (4 variants), Generate (6 variants incl. context cancellation, empty choices, opts forwarding), Chat (3 variants), Stop, constructor (4 variants), interface compliance.
+- [x] **backend_mlx_test.go** — 8 tests via mock TextModel (no build tag needed): Generate, Chat, Stream, ModelError, Close, ModelAccess, InterfaceCompliance, ConvertOpts.
+- [x] **score_race_test.go** — 6 race-condition tests: ConcurrentSemantic (20 responses, concurrency=4), ConcurrentMixedSuites (semantic+standard+content fan-out), SemaphoreBoundary (concurrency=1, verifies max concurrent==1), ContextCancellation (400 error→nil semantic), HeuristicOnlyNoRace (50 responses), MultiModelConcurrent (4 models×5 concurrent map writes).
+- [x] **benchmark_test.go** — 25 benchmarks: HeuristicScore (5 sizes: 25µs–8.8ms), ExactMatch (4 patterns: 171ns–2.1µs), JudgeExtractJSON (6 variants: 2.5–3.4µs), Judge round-trip (2 suites: ~52µs), ScoreAll (2 modes: 25µs–4.5ms), sub-components (5 heuristic stages: 244ns–88µs). Baselines on M3 Ultra.
 
 ---