From 197326bd2297b1c291cdd321bb7b72f5f094b27e Mon Sep 17 00:00:00 2001
From: Snider <snider@host.uk.com>
Date: Thu, 19 Feb 2026 23:28:40 +0000
Subject: [PATCH] docs: mark batch inference API complete in TODO.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 TODO.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/TODO.md b/TODO.md
index d56fb60..ac9a404 100644
--- a/TODO.md
+++ b/TODO.md
@@ -45,7 +45,7 @@ Implementation plan: `docs/plans/2026-02-19-backend-abstraction-plan.md`
 
 ## Phase 5: Ecosystem Integration (Virgil wishlist)
 
-- [ ] **Batch inference API** — go-i18n Phase 2a wants ~5K sentences/sec through Gemma3-1B. Single-prompt `Generate(..., WithMaxTokens(1))` works functionally for classification but won't hit 5K/sec. True batch inference (multiple prompts through one forward pass) is needed.
+- [x] **Batch inference API** — ✅ `Classify` (prefill-only, fast path) and `BatchGenerate` (autoregressive) implemented. Added `ForwardMasked` to InternalModel interface, threaded attention masks through Gemma3 and Qwen3 decoders. Mask: `[N, 1, L, L]` combining causal + padding (0=attend, -inf=ignore). Right-padded, sorted by length. Gemma3-1B 4-bit: **152 prompts/s** classify (4 prompts), BatchGenerate produces coherent per-prompt output. Types (`ClassifyResult`, `BatchResult`, `WithLogits`) in go-inference. 6 new tests (3 mask unit, 3 model). Design doc: `docs/plans/2026-02-19-batch-inference-design.md`.
 - [ ] **Inference metrics** — Expose tokens/sec, peak memory, GPU utilisation as structured data. LEM Lab dashboard and go-ai scoring engine both want this.
 - [ ] **Model quantisation awareness** — MLX supports 4-bit and 8-bit quantised models. The loader already handles quantised safetensors (GroupSize, Bits in config).
 - [ ] **Embed-friendly model loading** — Add `Discover(baseDir)` that scans for available models and returns metadata.