From 3918051112d01675592d2e0d7a5029c95d44cc0d Mon Sep 17 00:00:00 2001
From: Snider <snider@host.uk.com>
Date: Fri, 20 Feb 2026 15:01:55 +0000
Subject: [PATCH] docs: graduate TODO/FINDINGS into production documentation

Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
---
 CLAUDE.md            | 143 ++--------------
 FINDINGS.md          | 208 ------------------------
 TODO.md              | 193 ----------------------
 docs/architecture.md | 378 +++++++++++++++++++++++++++++++++++++++++++
 docs/development.md  | 307 +++++++++++++++++++++++++++++++++++
 docs/history.md      | 194 ++++++++++++++++++++++
 6 files changed, 892 insertions(+), 531 deletions(-)
 delete mode 100644 FINDINGS.md
 delete mode 100644 TODO.md
 create mode 100644 docs/architecture.md
 create mode 100644 docs/development.md
 create mode 100644 docs/history.md

diff --git a/CLAUDE.md b/CLAUDE.md
index 92e5289..9298b0c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,43 +1,18 @@
-# CLAUDE.md — go-ml Domain Expert Guide
+# CLAUDE.md — go-ml Agent Guide
 
-You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates your work via TODO.md. Pick up tasks in phase order, mark `[x]` when done, commit and push.
+You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates work. Pick up tasks in phase order, mark `[x]` when done, commit and push.
 
 ## What This Package Does
 
-ML inference backends, scoring engine, and agent orchestrator. 7.5K LOC across 41 Go files. Provides:
+ML inference backends, scoring engine, and agent orchestrator. ~7,500 LOC across 41 Go files. Provides:
 
 - **Pluggable inference backends** — MLX/Metal (darwin/arm64), llama.cpp (subprocess), HTTP/Ollama (OpenAI-compatible)
-- **Multi-suite scoring engine** — Heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K)
-- **23 capability probes** — Binary pass/fail tests across 16 categories (math, logic, code, etc.)
-- **GGUF model management** — Format parsing, conversion, inventory
+- **Multi-suite scoring engine** — heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K)
+- **23 capability probes** — binary pass/fail tests across 16 categories
+- **GGUF model management** — format parsing, conversion, inventory
 - **Agent orchestrator** — SSH checkpoint discovery, InfluxDB streaming, batch evaluation
 
-## Critical Context: go-inference Migration
-
-**Phase 1 is complete.** Both directions of the bridge are implemented:
-
-1. **Forward adapter** (`adapter.go`): `inference.TextModel` (iter.Seq) -> `ml.Backend`/`ml.StreamingBackend` (string/callback). Used by `backend_mlx.go` to wrap Metal GPU models.
-2. **Reverse adapters** (`backend_http_textmodel.go`): `HTTPBackend`/`LlamaBackend` -> `inference.TextModel`. Enables HTTP and llama-server backends to be used anywhere that expects a go-inference TextModel.
-
-### Interface Bridge (DONE)
-
-```
-ml.Backend (string)  <──adapter.go──>  inference.TextModel (iter.Seq[Token])
-                     <──backend_http_textmodel.go──>
-```
-
-- `InferenceAdapter`: TextModel -> Backend + StreamingBackend (for MLX, ROCm, etc.)
-- `HTTPTextModel`: HTTPBackend -> TextModel (for remote APIs)
-- `LlamaTextModel`: LlamaBackend -> TextModel (for managed llama-server)
-
-### backend_mlx.go (DONE)
-
-Rewritten from 253 LOC to ~35 LOC. Loads via `inference.LoadModel()` and wraps in `InferenceAdapter`. Uses go-mlx's Metal backend registered via `init()`.
-
-### Downstream Consumers Verified
-
-- `service.go` — `Service.Generate()` calls `Backend.Generate()`. InferenceAdapter satisfies Backend. No changes needed.
-- `judge.go` — `Judge.judgeChat()` calls `Backend.Generate()`. Same contract, works as before.
+See `docs/architecture.md` for the full architecture reference.
 
 ## Commands
 
@@ -45,7 +20,7 @@ Rewritten from 253 LOC to ~35 LOC. Loads via `inference.LoadModel()` and wraps i
 go mod download                  # FIRST RUN: populate go.sum
 go test ./...                    # Run all tests
 go test -v -run TestHeuristic    # Single test
-go test -bench=. ./...           # Benchmarks (none exist yet)
+go test -bench=. ./...           # Benchmarks
 go test -race ./...              # Race detector
 go vet ./...                     # Static analysis
 ```
@@ -56,103 +31,16 @@ All resolve via `replace` directives in go.mod:
 
 | Module | Local Path | Notes |
 |--------|-----------|-------|
-| `forge.lthn.ai/core/go` | `../host-uk/core` | Framework (ServiceRuntime, process, log) |
+| `forge.lthn.ai/core/go` | `../go` | Framework (ServiceRuntime, process, log) |
 | `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend (darwin/arm64 only) |
 | `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Backend interfaces |
 
-## Architecture
-
-### Backends (pluggable inference)
-
-| File | Backend | Status |
-|------|---------|--------|
-| `adapter.go` | InferenceAdapter (TextModel -> Backend) | DONE — bridges go-inference to ml.Backend |
-| `backend_mlx.go` | MLX/Metal GPU | DONE — uses go-inference LoadModel + InferenceAdapter |
-| `backend_http.go` | HTTP API (OpenAI-compatible) | Works as ml.Backend |
-| `backend_http_textmodel.go` | HTTPTextModel + LlamaTextModel | DONE — reverse wrappers (Backend -> TextModel) |
-| `backend_llama.go` | llama-server subprocess | Works as ml.Backend |
-| `ollama.go` | Ollama helpers | Works |
-
-### Scoring Engine
-
-| File | LOC | Purpose |
-|------|-----|---------|
-| `score.go` | 212 | Concurrent scoring orchestrator (semaphore-bounded workers) |
-| `heuristic.go` | 258 | 9 regex-based metrics, LEK composite score |
-| `judge.go` | 205 | LLM-as-judge (6 scoring methods) |
-| `exact.go` | 77 | GSM8K exact-match with numeric extraction |
-| `probes.go` | 273 | 23 binary capability probes across 16 categories |
-
-### Data Pipeline
-
-| File | LOC | Purpose |
-|------|-----|---------|
-| `agent.go` | 1,070 | Scoring agent (SSH checkpoint discovery, InfluxDB) |
-| `worker.go` | 403 | LEM API worker for distributed inference |
-| `service.go` | 162 | Core framework integration (lifecycle, backend registry) |
-| `ingest.go` | 384 | JSONL response loading |
-| `db.go` | 258 | DuckDB analytics storage |
-| `gguf.go` | 369 | GGUF model format parsing |
-
-### Backend Architecture
-
-Two interface families coexist, bridged by adapters:
-
-**`inference.TextModel`** (iterator-based) is the **preferred API** for new code. Returns `iter.Seq[inference.Token]` for streaming. Defined in `forge.lthn.ai/core/go-inference`. Use this for GPU backends (MLX Metal, ROCm) and any code that needs token-level control.
-
-**`ml.Backend`** (string-based) is the **compatibility layer**, still fully supported. Returns complete strings. Used by `service.go`, `judge.go`, and external consumers like `host-uk/cli`.
-
-**`ml.StreamingBackend`** is **deprecated**. New code should use `inference.TextModel` with `iter.Seq[Token]` directly. Retained for backward compatibility with existing callers.
-
-**Adapters:**
-
-| Adapter | Direction | File |
-|---------|-----------|------|
-| `InferenceAdapter` | `inference.TextModel` -> `ml.Backend` + `ml.StreamingBackend` | `adapter.go` |
-| `HTTPTextModel` | `ml.HTTPBackend` -> `inference.TextModel` | `backend_http_textmodel.go` |
-| `LlamaTextModel` | `ml.LlamaBackend` -> `inference.TextModel` | `backend_http_textmodel.go` |
-
-**Unified types (Phase 2):**
-
-- `ml.Message` is a type alias for `inference.Message` — the types are identical, no conversion needed between packages.
-- `ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model overrides. The `convertOpts()` helper maps GenOpts to `[]inference.GenerateOption`.
-
-### Key Types
-
-```go
-// Backend interface (inference.go) — compatibility layer
-type Backend interface {
-    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
-    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
-    Name() string
-    Available() bool
-}
-
-// Deprecated: use inference.TextModel with iter.Seq[Token] directly
-type StreamingBackend interface {
-    Backend
-    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
-    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
-}
-
-type GenOpts struct {
-    Temperature   float64
-    MaxTokens     int
-    Model         string  // override model for this request
-    TopK          int     // top-k sampling (0 = disabled)
-    TopP          float64 // nucleus sampling threshold (0 = disabled)
-    RepeatPenalty float64 // repetition penalty (0 = disabled, 1.0 = no penalty)
-}
-
-// Type alias — identical to inference.Message
-type Message = inference.Message
-```
-
 ## Coding Standards
 
-- **UK English**: colour, organisation, centre
-- **Tests**: testify assert/require (existing), Pest-style names welcome for new tests
-- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(mlx):`
+- **UK English**: colour, organisation, centre, licence (noun)
+- **SPDX header**: `// SPDX-Licence-Identifier: EUPL-1.2` in every new source file
+- **Tests**: testify assert/require; `_Good`/`_Bad`/`_Ugly` suffix pattern
+- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(agent):`
 - **Co-Author**: `Co-Authored-By: Virgil <virgil@lethean.io>`
 - **Licence**: EUPL-1.2
 - **Imports**: stdlib → forge.lthn.ai → third-party, each group separated by blank line
@@ -161,8 +49,3 @@ type Message = inference.Message
 
 - **Repo**: `forge.lthn.ai/core/go-ml`
 - **Push via SSH**: `git push forge main` (remote: `ssh://git@forge.lthn.ai:2223/core/go-ml.git`)
-
-## Task Queue
-
-See `TODO.md` for prioritised work. Phase 1 (go-inference migration) is the critical path.
-See `FINDINGS.md` for research notes and interface mapping.
diff --git a/FINDINGS.md b/FINDINGS.md
deleted file mode 100644
index ae42e4f..0000000
--- a/FINDINGS.md
+++ /dev/null
@@ -1,208 +0,0 @@
-# FINDINGS.md — go-ml Research & Discovery
-
-## 2026-02-19: Split from go-ai (Virgil)
-
-### Origin
-
-Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
-
-### What Was Extracted
-
-- 41 Go files (~7,494 LOC excluding tests)
-- 6 test files (backend_http, exact, heuristic, judge, probes, score)
-- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
-
-### Dependencies
-
-- `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only)
-- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1)
-- `forge.lthn.ai/core/go` — Framework services, process management, logging
-- `github.com/marcboeker/go-duckdb` — Analytics storage
-- `github.com/parquet-go/parquet-go` — Columnar data I/O
-- `github.com/stretchr/testify` — Test assertions
-
-### Consumers
-
-- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`)
-- LEM Lab — Uses MLXBackend for chat inference
-- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference)
-
-## go-inference Interface Mapping
-
-### Type Correspondence
-
-| go-ml | go-inference | Notes |
-|-------|-------------|-------|
-| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel |
-| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming |
-| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. |
-| `ml.Message` | `inference.Message` | Identical struct: Role + Content |
-| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks |
-| (no equivalent) | `inference.Token` | `{ID int32, Text string}` |
-| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] |
-
-### Method Mapping
-
-```
-ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error)
-   ↕ InferenceAdapter collects tokens
-inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
-
-ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error
-   ↕ InferenceAdapter forwards tokens to callback
-inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
-
-ml.GenOpts{Temperature: 0.7, MaxTokens: 2048}
-   ↕ convertOpts helper
-inference.WithTemperature(0.7), inference.WithMaxTokens(2048)
-```
-
-### backend_mlx.go Before/After
-
-**Before** (253 LOC — BROKEN, old subpackage imports):
-```go
-import (
-    "forge.lthn.ai/core/go-mlx"
-    "forge.lthn.ai/core/go-mlx/cache"    // REMOVED
-    "forge.lthn.ai/core/go-mlx/model"    // REMOVED
-    "forge.lthn.ai/core/go-mlx/sample"   // REMOVED
-    "forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED
-)
-
-type MLXBackend struct {
-    model      model.Model
-    tok        *tokenizer.Tokenizer
-    caches     []cache.Cache
-    sampler    sample.Sampler
-    // ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup
-}
-```
-
-**After** (~60 LOC — uses go-inference + InferenceAdapter):
-```go
-import (
-    "forge.lthn.ai/core/go-inference"
-    _ "forge.lthn.ai/core/go-mlx"  // registers "metal" backend via init()
-)
-
-func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
-    m, err := inference.LoadModel(modelPath)
-    if err != nil { return nil, fmt.Errorf("mlx: %w", err) }
-    return &InferenceAdapter{model: m, name: "mlx"}, nil
-}
-```
-
-All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface.
-
-## Scoring Engine Architecture
-
-### 5 Suites
-
-| Suite | Method | LLM needed? | Metrics |
-|-------|--------|-------------|---------|
-| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite |
-| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) |
-| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) |
-| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen |
-| **Exact** | Numeric extraction | No | GSM8K answer matching |
-
-### LEK Score Formula
-
-```
-LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
-    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
-```
-
-Positive signals: engagement depth, creative form, emotional register, first-person voice.
-Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output.
-
-### Concurrency Model
-
-`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex.
-
-## Phase 2 Audit: StreamingBackend Usage (Virgil, 20 Feb 2026)
-
-### Callers of GenerateStream/ChatStream
-
-Only 2 files across the entire ecosystem call StreamingBackend methods:
-
-1. **`host-uk/cli/cmd/ml/cmd_serve.go`** (lines 146, 201, 319)
-   - Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming
-   - `/v1/completions` → `streamer.GenerateStream()` (line 201)
-   - `/v1/chat/completions` → `streamer.ChatStream()` (line 319)
-   - Has non-streaming fallback: `backend.Generate()` when assertion fails
-
-2. **`host-uk/cli/cmd/ml/cmd_chat.go`**
-   - Direct `ChatStream()` call for terminal token-by-token echo
-   - No fallback — assumes backend supports streaming
-
-### Non-streaming consumers (use Backend.Generate only)
-
-| File | Method | Notes |
-|------|--------|-------|
-| service.go | `Backend.Generate()` | Backend registry dispatch |
-| judge.go | `Backend.Generate()` | Via judgeChat() |
-| agent.go | `Backend.Generate()` | Probe evaluation |
-| expand.go | `Backend.Generate()` | Prompt expansion |
-| go-ai/mcp/tools_ml.go | `ml.Service` | Via service layer |
-
-### Backend Implementation Status
-
-| Backend | Backend? | StreamingBackend? | Notes |
-|---------|----------|-------------------|-------|
-| InferenceAdapter | YES | YES | Bridges iter.Seq[Token] → callbacks |
-| HTTPBackend | YES | NO | Returns complete string from API |
-| LlamaBackend | YES | NO | Returns complete string via HTTP |
-
-### Conclusion
-
-StreamingBackend is only needed by `host-uk/cli` (2 files, out of go-ml scope). Safe to deprecate in go-ml with a comment. The actual migration of those CLI files is a separate task for the cli repo.
-
-### GenOpts vs GenerateConfig Field Comparison
-
-| ml.GenOpts | inference.GenerateConfig | Type |
-|-----------|--------------------------|------|
-| Temperature | Temperature | float64 vs float32 |
-| MaxTokens | MaxTokens | int (same) |
-| Model | (none) | string |
-| (none) | TopK | int |
-| (none) | TopP | float32 |
-| (none) | StopTokens | []int32 |
-| (none) | RepeatPenalty | float32 |
-| (none) | ReturnLogits | bool |
-
-## Known Issues
-
-- ~~**backend_mlx.go imports dead subpackages**~~ — FIXED in Phase 1 (`c3c2c14`)
-- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together
-- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go
-- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists
-- **score.go concurrency untested** — No race condition tests
-- ~~**Message type duplication**~~ — FIXED in Phase 2 (`747e703`): type alias `Message = inference.Message`
-
-## Phase 3 Audit: agent.go Structure (Virgil, 20 Feb 2026)
-
-### File Layout (1,070 LOC)
-
-| Section | Lines | LOC | Purpose |
-|---------|-------|-----|---------|
-| Types & Config | 19–112 | ~95 | `AgentConfig`, `Checkpoint`, config maps, `AdapterMeta()` |
-| Main Loop | 141–343 | ~200 | `RunAgentLoop()`, checkpoint discovery, unscored filtering |
-| Evaluation | 345–700 | ~355 | MLX-native + conversion paths, 4 probe functions |
-| Judge & Push | 708–887 | ~180 | Scoring, InfluxDB line protocol, DuckDB dual-write |
-| Buffering | 926–977 | ~50 | JSONL buffer for InfluxDB failures |
-| SSH/SCP | 979–1070 | ~90 | `SSHCommand()`, `SCPFrom()`, `SCPTo()`, utility helpers |
-
-### Hardcoded Infrastructure
-
-- SSH options duplicated across 3 functions: `ConnectTimeout=10, BatchMode=yes, StrictHostKeyChecking=no`
-- InfluxDB timestamp base: `1739577600` (13 Feb 2026 00:00 UTC)
-- InfluxDB measurements: `probe_score`, `capability_score`, `capability_judge`, `content_score`
-- DuckDB tables: `checkpoint_scores`, `probe_results`
-
-### Test Coverage
-
-Zero tests for agent.go. Testable without infrastructure:
-- `AdapterMeta()` — pure function, dirname → metadata
-- `FindUnscored()` — filtering logic
-- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — JSONL round-trip
diff --git a/TODO.md b/TODO.md
deleted file mode 100644
index 54ffe67..0000000
--- a/TODO.md
+++ /dev/null
@@ -1,193 +0,0 @@
-# TODO.md — go-ml Task Queue
-
-Dispatched from Virgil in core/go. Pick up tasks in phase order.
-
----
-
-## Phase 1: go-inference Migration (CRITICAL PATH)
-
-Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring.
-
-### Step 1.1: Add go-inference dependency
-
-- [x] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves.
-
-### Step 1.2: Write the InferenceAdapter
-
-- [x] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement:
-  - `Generate()` — collect tokens from iterator into string
-  - `Chat()` — same, using `TextModel.Chat()`
-  - `GenerateStream()` — forward tokens to `TokenCallback`
-  - `ChatStream()` — same for chat
-  - `Name()` — delegate to `TextModel.ModelType()`
-  - `Available()` — always true (model already loaded)
-  - `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options
-
-  **Key mapping**:
-  ```
-  GenOpts.Temperature → inference.WithTemperature(float32(t))
-  GenOpts.MaxTokens   → inference.WithMaxTokens(n)
-  GenOpts.Model       → (ignored, model already loaded)
-  ```
-
-  **Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled).
-
-- [x] **Test adapter.go** — 13 test cases with mock TextModel (all pass). Test cases:
-  - Normal generation (collect tokens → string)
-  - Streaming (each token hits callback)
-  - Callback error stops iteration
-  - Context cancellation propagates
-  - Empty output (EOS immediately)
-  - Model error after partial output
-
-### Step 1.3: Rewrite backend_mlx.go
-
-- [x] **Replace backend_mlx.go** — Deleted the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replaced with ~35 LOC:
-  ```go
-  //go:build darwin && arm64
-
-  package ml
-
-  import (
-      "forge.lthn.ai/core/go-inference"
-      _ "forge.lthn.ai/core/go-mlx"  // registers "metal" backend
-  )
-
-  func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
-      m, err := inference.LoadModel(modelPath)
-      if err != nil {
-          return nil, fmt.Errorf("mlx: %w", err)
-      }
-      return &InferenceAdapter{model: m, name: "mlx"}, nil
-  }
-  ```
-  The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic.
-
-- [x] **Preserve memory controls** — Deferred: go-mlx handles cache/memory limits internally. Callers can use `mlx.SetCacheLimit()`/`mlx.SetMemoryLimit()` directly. No wrapper needed until a concrete use case arises.
-
-- [x] **Test backend_mlx.go** — Covered by Phase 4 `backend_mlx_test.go` (8 tests via mock TextModel). Integration smoke test with real model deferred until LEM Lab pipeline is wired.
-
-### Step 1.4: HTTPBackend and LlamaBackend wrappers
-
-- [x] **HTTPBackend go-inference wrapper** — `backend_http_textmodel.go`: `HTTPTextModel` wraps `HTTPBackend` to implement `inference.TextModel`. Generate/Chat yield entire response as single Token. Classify returns unsupported error. BatchGenerate processes prompts sequentially. 17 tests pass.
-
-- [x] **LlamaBackend go-inference wrapper** — `backend_http_textmodel.go`: `LlamaTextModel` embeds `HTTPTextModel`, overrides `ModelType()` -> "llama" and `Close()` -> `llama.Stop()`. 2 tests pass.
-
-### Step 1.5: Verify downstream consumers
-
-- [x] **Service.Generate() still works** — `service.go` calls `Backend.Generate()`. InferenceAdapter satisfies ml.Backend. HTTPBackend/LlamaBackend still implement ml.Backend directly. No changes needed.
-- [x] **Judge still works** — `judge.go` calls `Backend.Generate()` via `judgeChat()`. Same Backend contract, works as before. No changes needed.
-- [x] **go-ai tools_ml.go** — Uses `ml.Service` directly. `ml.Backend` interface is preserved, no code changes needed in go-ai.
-
----
-
-## Phase 2: Backend Consolidation
-
-After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile.
-
-### Audit Results (Virgil, 20 Feb 2026)
-
-**StreamingBackend callers** — Only 2 files in `host-uk/cli`:
-- `cmd/ml/cmd_serve.go` lines 146,201,319: Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming at `/v1/completions` and `/v1/chat/completions`
-- `cmd/ml/cmd_chat.go`: Direct `ChatStream()` call for interactive terminal token echo
-
-All other consumers (service.go, judge.go, agent.go, expand.go, go-ai tools_ml.go) use `Backend.Generate()` — NOT streaming.
-
-**Backend implementations**:
-- `InferenceAdapter` → implements Backend + StreamingBackend (via go-inference iter.Seq)
-- `HTTPBackend` → implements Backend only (no streaming)
-- `LlamaBackend` → implements Backend only (no streaming)
-
-### Step 2.1: Unify Message types
-
-- [x] **Type alias ml.Message → inference.Message** — In `inference.go`, replace the `Message` struct with:
-  ```go
-  type Message = inference.Message
-  ```
-  This is backward-compatible — all existing callers keep working. Remove the `convertMessages()` helper from `adapter.go` since types are now identical. Verify with `go build ./...` and `go test ./...`.
-
-### Step 2.2: Unify GenOpts
-
-- [x] **Add inference fields to GenOpts** — Extend `ml.GenOpts` to include the extra fields from `inference.GenerateConfig`:
-  ```go
-  type GenOpts struct {
-      Temperature   float64
-      MaxTokens     int
-      Model         string  // override model for this request
-      TopK          int     // NEW: from inference.GenerateConfig
-      TopP          float64 // NEW: from inference.GenerateConfig (float64 to match Temperature)
-      RepeatPenalty float64 // NEW: from inference.GenerateConfig
-  }
-  ```
-  Update `convertOpts()` in adapter.go to map the new fields. Existing callers that only set Temperature/MaxTokens/Model continue working unchanged.
-
-### Step 2.3: Deprecate StreamingBackend
-
-- [x] **Mark StreamingBackend as deprecated** — Add deprecation comment:
-  ```go
-  // Deprecated: StreamingBackend is retained for backward compatibility.
-  // New code should use inference.TextModel with iter.Seq[Token] directly.
-  // See InferenceAdapter for the bridge pattern.
-  type StreamingBackend interface { ... }
-  ```
-  Do NOT remove yet — `host-uk/cli` cmd_serve.go and cmd_chat.go still depend on it. Those migrations are out of scope for go-ml (they live in a different repo).
-
-### Step 2.4: Document migration path
-
-- [x] **Update CLAUDE.md** — Add "Backend Architecture" section documenting:
-  - `inference.TextModel` (iterator-based) is the preferred API for new code
-  - `ml.Backend` (string-based) is the compatibility layer, still supported
-  - `StreamingBackend` is deprecated, use `iter.Seq[Token]` directly
-  - `InferenceAdapter` bridges TextModel → Backend/StreamingBackend
-  - `HTTPTextModel`/`LlamaTextModel` bridges Backend → TextModel (reverse direction)
-
----
-
-## Phase 3: Agent Loop Modernisation
-
-`agent.go` (1,070 LOC) is the largest file with SSH, InfluxDB, scoring, and publishing mixed together. Decompose into focused files.
-
-### Step 3.1: Split agent.go into 5 files — COMPLETE
-
-- [x] **Split `agent.go` (1,070 LOC) into 5 focused files** — Commit `eae9ec9`. All `go build/test/vet` pass:
-  - `agent_config.go` (97 LOC): AgentConfig, Checkpoint, BaseModelMap, ModelFamilies, AdapterMeta()
-  - `agent_execute.go` (215 LOC): RunAgentLoop, DiscoverCheckpoints, GetScoredLabels, FindUnscored, ProcessOne, isMLXNative
-  - `agent_eval.go` (397 LOC): processMLXNative, processWithConversion, RunCapabilityProbes/Full, RunContentProbes, ProbeResult types
-  - `agent_influx.go` (291 LOC): ScoreCapabilityAndPush, ScoreContentAndPush, PushCapability*, BufferInfluxResult, ReplayInfluxBuffer
-  - `agent_ssh.go` (102 LOC): SSHCommand, SCPFrom, SCPTo, fileBase, EnvOr, IntEnvOr, ExpandHome
-
-### Step 3.2: Abstract SSH transport — COMPLETE
-
-- [x] **RemoteTransport interface + SSHTransport** — Commit `1c2a6a6`. Interface with Run/CopyFrom/CopyTo, SSHTransport implementation with functional options (WithPort, WithTimeout). AgentConfig.Transport field with lazy init. All callers updated (DiscoverCheckpoints, processMLXNative, processWithConversion). Old SSHCommand/SCPFrom/SCPTo preserved as deprecated wrappers. Build/test/vet clean.
-
-### Step 3.3: Configurable infrastructure — COMPLETE
-
-- [x] **Extract hardcoded values to constants** — Commit `12f3a1c`. 15 constants in agent_config.go: EpochBase, 5 InfluxDB measurements, 2 DuckDB tables, probe defaults (temp/maxTokens/truncation), InfluxBufferFile, LogSeparatorWidth, InterCheckpointDelay. Hardcoded probe counts replaced with len(). 7 files, build/test/vet clean.
-
-### Step 3.4: Agent tests — COMPLETE
-
-- [x] **Test `AdapterMeta()`** — 8 tests: known families (12 entries), variant suffix, subdirectory patterns, unknown fallback, no-prefix edge case. Commit `3e22761`.
-- [x] **Test `FindUnscored()`** — 5 tests: all unscored (sorted), some scored, all scored, empty input, nil scored map. Commit `3e22761`.
-- [x] **Test `BufferInfluxResult()`/`ReplayInfluxBuffer()`** — 4 tests: JSONL round-trip, multiple entries, empty file, missing file. Commit `3e22761`.
-- [x] **Test `DiscoverCheckpoints()`** — 6 tests: happy path (3 checkpoints across 2 dirs), subdirectory pattern, no adapters, SSH error, filter pattern, no safetensors. Uses `fakeTransport` mock implementing `RemoteTransport`. Commit `3e22761`.
-
----
-
-## Phase 4: Test Coverage — COMPLETE
-
-All 4 test files created and verified with `go test -race ./...`. Commit `09bf403`.
-
-- [x] **backend_llama_test.go** — 20 tests via httptest mock: Name, Available (4 variants), Generate (6 variants incl. context cancellation, empty choices, opts forwarding), Chat (3 variants), Stop, constructor (4 variants), interface compliance.
-- [x] **backend_mlx_test.go** — 8 tests via mock TextModel (no build tag needed): Generate, Chat, Stream, ModelError, Close, ModelAccess, InterfaceCompliance, ConvertOpts.
-- [x] **score_race_test.go** — 6 race-condition tests: ConcurrentSemantic (20 responses, concurrency=4), ConcurrentMixedSuites (semantic+standard+content fan-out), SemaphoreBoundary (concurrency=1, verifies max concurrent==1), ContextCancellation (400 error→nil semantic), HeuristicOnlyNoRace (50 responses), MultiModelConcurrent (4 models×5 concurrent map writes).
-- [x] **benchmark_test.go** — 25 benchmarks: HeuristicScore (5 sizes: 25µs–8.8ms), ExactMatch (4 patterns: 171ns–2.1µs), JudgeExtractJSON (6 variants: 2.5–3.4µs), Judge round-trip (2 suites: ~52µs), ScoreAll (2 modes: 25µs–4.5ms), sub-components (5 heuristic stages: 244ns–88µs). Baselines on M3 Ultra.
-
----
-
-## Workflow
-
-1. Virgil in core/go writes tasks here after research
-2. This repo's session picks up tasks in phase order
-3. Mark `[x]` when done, note commit hash
-4. New discoveries → add tasks, note in FINDINGS.md
-5. Push to forge after each completed step: `git push forge main`
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..5a6ef78
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,378 @@
+# go-ml Architecture
+
+## Overview
+
+`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.
+
+The package provides three distinct subsystems:
+
+1. **Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
+2. **Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
+3. **Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.
+
+---
+
+## Dependency Graph
+
+```
+forge.lthn.ai/core/go-ml
+    ├── forge.lthn.ai/core/go-inference   (shared TextModel/Token interfaces)
+    │       └── (no further Core deps)
+    ├── forge.lthn.ai/core/go-mlx         (Metal GPU inference, darwin/arm64 only)
+    │       └── forge.lthn.ai/core/go-inference
+    ├── forge.lthn.ai/core/go             (ServiceRuntime, process, log)
+    ├── github.com/marcboeker/go-duckdb   (analytics storage)
+    └── github.com/parquet-go/parquet-go  (columnar data I/O)
+```
+
+### Role of each dependency
+
+| Module | Purpose |
+|--------|---------|
+| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. |
+| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. |
+| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. |
+| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
+| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. |
+
+---
+
+## Backend Architecture
+
+Two interface families coexist within go-ml, connected by a set of adapters.
+
+### The `ml.Backend` interface (compatibility layer)
+
+```go
+type Backend interface {
+    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
+    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
+    Name() string
+    Available() bool
+}
+```
+
+`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface.
+
+### The `inference.TextModel` interface (preferred for new code)
+
+Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`.
+
+### `ml.StreamingBackend` (deprecated)
+
+```go
+// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
+type StreamingBackend interface {
+    Backend
+    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
+    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
+}
+```
+
+Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it.
+
+### Type unification
+
+`ml.Message` is a type alias for `inference.Message`:
+
+```go
+type Message = inference.Message
+```
+
+The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages.
+
+`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection:
+
+```go
+type GenOpts struct {
+    Temperature   float64
+    MaxTokens     int
+    Model         string  // per-request model override; ignored by GPU backends
+    TopK          int
+    TopP          float64
+    RepeatPenalty float64
+}
+```
+
+---
+
+## Backend Implementations
+
+### HTTPBackend (`backend_http.go`)
+
+Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).
+
+- Implements `ml.Backend` only (no streaming — returns complete response strings).
+- Retries up to 3 times with exponential backoff on 5xx and connection errors.
+- 300-second HTTP client timeout suitable for long-running inference.
+
+### LlamaBackend (`backend_llama.go`)
+
+Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`.
+
+- Implements `ml.Backend`.
+- `Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds.
+- `Stop()` kills the managed process via the Core `process.Service`.
+- Supports optional LoRA adapter loading via `--lora`.
+
+### InferenceAdapter (`adapter.go`)
+
+Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.
+
+```
+inference.TextModel (iter.Seq[Token])
+        │
+        └─── InferenceAdapter ───► ml.Backend (string)
+                                ───► ml.StreamingBackend (TokenCallback)
+```
+
+Key behaviours:
+
+- `Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
+- `GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops.
+- `Available()` always returns `true` — the model is already loaded when the adapter is constructed.
+- `Close()` delegates to `TextModel.Close()`, releasing GPU memory.
+
+### MLX Backend (`backend_mlx.go`, darwin/arm64 only)
+
+```go
+//go:build darwin && arm64
+
+func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
+    m, err := inference.LoadModel(modelPath, loadOpts...)
+    // ...
+    return NewInferenceAdapter(m, "mlx"), nil
+}
+```
+
+The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon.
+
+The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package.
+
+### Reverse adapters (`backend_http_textmodel.go`)
+
+Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`).
+
+| Type | Wraps | Notes |
+|------|-------|-------|
+| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. |
+| `LlamaTextModel` | `*LlamaBackend` | Embeds `HTTPTextModel`; overrides `ModelType()` → `"llama"` and `Close()` → `llama.Stop()`. |
+
+### Adapter map (all directions)
+
+```
+ml.Backend (string)  <──── InferenceAdapter ────  inference.TextModel (iter.Seq[Token])
+                           (adapter.go)
+
+ml.HTTPBackend ──── HTTPTextModel ────►  inference.TextModel
+ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
+                    (backend_http_textmodel.go)
+```
+
+---
+
+## Service Layer (`service.go`)
+
+`Service` integrates go-ml into the Core framework lifecycle:
+
+```go
+core.New(
+    framework.WithName("ml", ml.NewService(ml.Options{
+        OllamaURL:   "http://localhost:11434",
+        JudgeURL:    "http://localhost:11434",
+        JudgeModel:  "qwen3:8b",
+        Concurrency: 4,
+        Suites:      "all",
+    })),
+)
+```
+
+`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`.
+
+---
+
+## Scoring Engine
+
+### Engine (`score.go`)
+
+`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently.
+
+```
+ScoreAll(responses []Response) map[string][]PromptScore
+         │
+         ├── Heuristic (inline, no goroutine)
+         └── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
+```
+
+The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.
+
+Suites are selected at engine construction time via a comma-separated string or `"all"`.
+
+### Heuristic scoring (`heuristic.go`)
+
+Analyses a response using pre-compiled regular expressions. No LLM is needed.
+
+Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:
+
+```
+LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
+    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
+```
+
+**Positive signals**
+
+| Sub-score | What it measures |
+|-----------|-----------------|
+| `EngagementDepth` | Structural markers (headings, bold), ethical vocabulary, technical depth, word count |
+| `CreativeForm` | Poetry structure (short lines), narrative openings, metaphor density |
+| `EmotionalRegister` | Emotional vocabulary (feel, grief, compassion, etc.) |
+| `FirstPerson` | Sentences beginning with "I" or containing first-person agency verbs |
+
+**Negative signals**
+
+| Sub-score | What it measures |
+|-----------|-----------------|
+| `ComplianceMarkers` | RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") |
+| `FormulaicPreamble` | Opener templates ("Sure, let's...", "Great question") |
+| `Degeneration` | Sentence repetition ratio (looping/stuck output) |
+| `EmptyBroken` | Empty, error-prefixed, or pad-token-polluted responses |
+
+### Judge (`judge.go`)
+
+`Judge` uses any `Backend` as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.
+
+```go
+judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
+scores, err := judge.ScoreSemantic(ctx, prompt, response)
+```
+
+JSON extraction (`extractJSON`) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.
+
+Six scoring methods are available:
+
+| Method | Suite | Dimensions |
+|--------|-------|-----------|
+| `ScoreSemantic` | semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept |
+| `ScoreContent` | content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister |
+| `ScoreCapability` | (agent) | Reasoning, Correctness, Clarity |
+| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness |
+| `ScoreDoNotAnswer` | standard | Safety, Nuance |
+| `ScoreToxigen` | standard | Kindness, Awareness |
+
+### Exact match (`exact.go`)
+
+`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required.
+
+### Capability probes (`probes.go`)
+
+23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response.
+
+| Category | Probes | Examples |
+|----------|--------|---------|
+| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci |
+| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
+| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions |
+| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug |
+| Word problems (2) | word | speed/distance, sibling counting |
+
+`StripThinkBlocks()` removes `<think>...</think>` sections from DeepSeek R1 responses before checking.
+
+---
+
+## Agent Orchestrator
+
+The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").
+
+### Files
+
+| File | LOC | Responsibility |
+|------|-----|---------------|
+| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
+| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne` |
+| `agent_eval.go` | 397 | MLX-native and conversion evaluation paths, capability and content probe runners |
+| `agent_influx.go` | 291 | InfluxDB line-protocol push, JSONL buffer for offline replay |
+| `agent_ssh.go` | 102 | `RemoteTransport` interface, `SSHTransport` implementation, utility helpers |
+
+### Workflow
+
+```
+RunAgentLoop
+    │
+    ├── ReplayInfluxBuffer    (flush any buffered writes from previous failures)
+    ├── DiscoverCheckpoints   (SSH ls on M3 adapter directories)
+    ├── GetScoredLabels       (InfluxDB query for already-scored (run_id, label) pairs)
+    ├── FindUnscored          (set difference, sorted by dirname + iteration)
+    └── ProcessOne (for each unscored checkpoint)
+            │
+            ├── isMLXNative?  YES → processMLXNative   (serve directly via mlx_lm.server)
+            │                  NO → processWithConversion (MLX→GGUF, then llama-server)
+            │
+            ├── RunCapabilityProbes   (23 binary probes)
+            ├── RunContentProbes      (sovereignty probes)
+            ├── ScoreCapabilityAndPush (judge + InfluxDB)
+            └── ScoreContentAndPush   (judge + InfluxDB)
+```
+
+### RemoteTransport
+
+`RemoteTransport` abstracts SSH/SCP so that tests can supply an in-memory fake:
+
+```go
+type RemoteTransport interface {
+    Run(ctx context.Context, cmd string) (string, error)
+    CopyFrom(ctx context.Context, remote, local string) error
+    CopyTo(ctx context.Context, local, remote string) error
+}
+```
+
+`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`.
+
+### Checkpoint discovery
+
+`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename.
+
+`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`.
+
+### Persistence
+
+Results are written to two stores simultaneously:
+
+- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`.
+- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`.
+
+If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer.
+
+---
+
+## Data Pipeline
+
+| File | Purpose |
+|------|---------|
+| `ingest.go` | Load JSONL response files into `[]Response` slices |
+| `db.go` | DuckDB schema creation, insert, and query helpers |
+| `influx.go` | InfluxDB HTTP client (line protocol write, SQL query) |
+| `gguf.go` | GGUF file format parsing (magic, version, metadata, tensor inventory) |
+| `worker.go` | LEM API worker for distributed inference job dispatch |
+| `expand.go` | Prompt expansion using a backend |
+| `normalize.go` | Response normalisation utilities |
+| `parquet.go` | Parquet dataset export |
+
+---
+
+## Test Coverage
+
+| File | Tests | What is covered |
+|------|-------|----------------|
+| `adapter_test.go` | 13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors |
+| `backend_http_test.go` | — | HTTPBackend: generate, chat, retries, status codes |
+| `backend_http_textmodel_test.go` | 19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch |
+| `backend_llama_test.go` | 20 | LlamaBackend: start, stop, health, generate, chat, constructor variants |
+| `backend_mlx_test.go` | 8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion |
+| `heuristic_test.go` | — | All nine heuristic sub-scores and LEK formula |
+| `judge_test.go` | — | JSON extraction variants, ScoreSemantic, ScoreContent |
+| `exact_test.go` | — | Numeric extraction patterns |
+| `probes_test.go` | — | All 23 capability probe Check functions |
+| `score_test.go` | — | Engine suite selection, ScoreAll grouping |
+| `score_race_test.go` | 6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes |
+| `agent_test.go` | 23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport |
+| `benchmark_test.go` | 25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) |
diff --git a/docs/development.md b/docs/development.md
new file mode 100644
index 0000000..04ce72c
--- /dev/null
+++ b/docs/development.md
@@ -0,0 +1,307 @@
+# go-ml Development Guide
+
+## Prerequisites
+
+### Required
+
+- **Go 1.25** or later (the module uses `go 1.25.5`)
+- **Go workspace** — go-ml is part of the `host-uk/core` Go workspace; `replace` directives in `go.mod` resolve sibling modules from local paths
+
+### Required sibling modules (local paths)
+
+| Module | Local path | Notes |
+|--------|-----------|-------|
+| `forge.lthn.ai/core/go` | `../go` | Framework, process management, logging |
+| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Token interfaces |
+| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend |
+
+All three must be checked out as siblings of `go-ml` (i.e. all four directories share the same parent).
+
+### Platform-specific
+
+- **Metal GPU (`NewMLXBackend`)** — requires macOS on Apple Silicon (darwin/arm64). The `backend_mlx.go` file carries a `//go:build darwin && arm64` build tag and is excluded on other platforms. All other features work on Linux and amd64.
+- **llama-server** — the `llama-server` binary from llama.cpp must be on `PATH` or the path provided in `LlamaOpts.LlamaPath`.
+- **DuckDB** — uses CGo; a C compiler (`gcc` or `clang`) is required.
+
+---
+
+## Getting Started
+
+```bash
+# On first checkout, populate go.sum
+go mod download
+
+# Verify the build (all platforms)
+go build ./...
+
+# Verify the build excluding Metal backend (Linux / CI)
+GOFLAGS='-tags nomlx' go build ./...
+```
+
+---
+
+## Build and Test Commands
+
+```bash
+# Run all tests
+go test ./...
+
+# Run with race detector (recommended before committing)
+go test -race ./...
+
+# Run a single test by name
+go test -v -run TestHeuristic ./...
+go test -v -run TestEngine_ScoreAll_ConcurrentSemantic ./...
+
+# Run benchmarks
+go test -bench=. ./...
+go test -bench=BenchmarkHeuristicScore ./...
+
+# Static analysis
+go vet ./...
+
+# Tidy dependencies
+go mod tidy
+```
+
+---
+
+## Test Patterns
+
+### Naming convention
+
+Tests use a `_Good`, `_Bad`, `_Ugly` suffix pattern:
+
+- `_Good` — happy path (expected success)
+- `_Bad` — expected error conditions (invalid input, unreachable server)
+- `_Ugly` — panic and edge-case paths
+
+### Mock backends
+
+For tests that exercise `Backend`-dependent code (judge, agent, scoring engine) without a real inference server, implement `Backend` directly:
+
+```go
+type mockBackend struct {
+    response string
+    err      error
+}
+
+func (m *mockBackend) Generate(_ context.Context, _ string, _ ml.GenOpts) (string, error) {
+    return m.response, m.err
+}
+func (m *mockBackend) Chat(_ context.Context, _ []ml.Message, _ ml.GenOpts) (string, error) {
+    return m.response, m.err
+}
+func (m *mockBackend) Name() string    { return "mock" }
+func (m *mockBackend) Available() bool { return true }
+```
+
+### Mock TextModel
+
+For tests that exercise `InferenceAdapter` without Metal GPU hardware, implement `inference.TextModel`:
+
+```go
+type mockTextModel struct {
+    tokens []string
+    err    error
+}
+
+func (m *mockTextModel) Generate(ctx context.Context, prompt string, opts ...inference.GenerateOption) iter.Seq[inference.Token] {
+    return func(yield func(inference.Token) bool) {
+        for _, t := range m.tokens {
+            if !yield(inference.Token{Text: t}) {
+                return
+            }
+        }
+    }
+}
+// ... implement remaining TextModel methods
+func (m *mockTextModel) Err() error { return m.err }
+```
+
+### Mock RemoteTransport
+
+For agent tests that would otherwise require an SSH connection:
+
+```go
+type fakeTransport struct {
+    outputs map[string]string
+    errors  map[string]error
+}
+
+func (f *fakeTransport) Run(_ context.Context, cmd string) (string, error) {
+    if err, ok := f.errors[cmd]; ok {
+        return "", err
+    }
+    return f.outputs[cmd], nil
+}
+func (f *fakeTransport) CopyFrom(_ context.Context, _, _ string) error { return nil }
+func (f *fakeTransport) CopyTo(_ context.Context, _, _ string) error   { return nil }
+```
+
+Inject via `AgentConfig.Transport`:
+
+```go
+cfg := &ml.AgentConfig{
+    Transport: &fakeTransport{outputs: map[string]string{...}},
+}
+```
+
+### HTTP mock server
+
+For `HTTPBackend` tests, use `net/http/httptest`:
+
+```go
+srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+    json.NewEncoder(w).Encode(map[string]any{
+        "choices": []map[string]any{
+            {"message": map[string]string{"role": "assistant", "content": "hello"}},
+        },
+    })
+}))
+defer srv.Close()
+backend := ml.NewHTTPBackend(srv.URL, "test-model")
+```
+
+---
+
+## Adding a New Backend
+
+A backend must implement `ml.Backend`:
+
+```go
+type Backend interface {
+    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
+    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
+    Name() string
+    Available() bool
+}
+```
+
+### Steps
+
+1. Create `backend_{name}.go` in the package root.
+2. Add the `// SPDX-Licence-Identifier: EUPL-1.2` header.
+3. Add a compile-time interface check:
+   ```go
+   var _ Backend = (*MyBackend)(nil)
+   ```
+4. Implement `Generate` as a thin wrapper around `Chat` where possible (follows the pattern of `HTTPBackend`).
+5. Create `backend_{name}_test.go` with `_Good`, `_Bad`, and interface-compliance tests.
+6. Register the backend in `service.go`'s `OnStartup` if it warrants lifecycle management, or document that callers must register it via `Service.RegisterBackend`.
+
+### GPU backends
+
+If the backend wraps a `go-inference.TextModel` (e.g. a new hardware accelerator), use `InferenceAdapter` rather than re-implementing the polling/streaming logic:
+
+```go
+m, err := myBackendPackage.LoadModel(modelPath)
+if err != nil {
+    return nil, err
+}
+return ml.NewInferenceAdapter(m, "my-backend"), nil
+```
+
+---
+
+## Adding a New Scoring Suite
+
+1. Add a new scoring function or type in a dedicated file (e.g. `my_suite.go`).
+2. Add the suite name to `Engine.NewEngine`'s suite selection logic in `score.go`.
+3. Add a result field to `PromptScore` in `types.go`.
+4. Add the goroutine fan-out case in `Engine.ScoreAll` in `score.go`.
+5. Add race condition tests in `score_race_test.go`.
+
+---
+
+## Coding Standards
+
+### Language
+
+Use **UK English** throughout: colour, organisation, centre, licence (noun), authorise. The only exception is identifiers in external APIs that use American spellings — do not rename those.
+
+### File headers
+
+Every new file must begin with:
+
+```go
+// SPDX-Licence-Identifier: EUPL-1.2
+```
+
+### Strict types
+
+All parameters and return types must be explicitly typed. Avoid `interface{}` or `any` except at JSON unmarshalling boundaries.
+
+### Import grouping
+
+Three groups, each separated by a blank line:
+
+```go
+import (
+    "context"           // stdlib
+    "fmt"
+
+    "forge.lthn.ai/core/go-inference"  // forge.lthn.ai modules
+
+    "github.com/stretchr/testify/assert"  // third-party
+)
+```
+
+### Error wrapping
+
+Use `fmt.Errorf("context: %w", err)` for wrapping. Use `log.E("pkg.Type.Method", "what failed", err)` from the Core framework for structured error logging with stack context.
+
+### Concurrency
+
+- Protect shared maps with `sync.RWMutex` or `sync.Mutex` as appropriate.
+- Use semaphore channels (buffered `chan struct{}`) to bound goroutine concurrency rather than `sync.Pool` or `errgroup` with fixed limits.
+- Always check `model.Err()` after exhausting a `go-inference` token iterator — the iterator itself carries no error; the error is stored on the model.
+
+---
+
+## Conventional Commits
+
+Use the following scopes:
+
+| Scope | When to use |
+|-------|-------------|
+| `backend` | Changes to any `backend_*.go` file or the `adapter.go` bridge |
+| `scoring` | Changes to `score.go`, `heuristic.go`, `judge.go`, `exact.go` |
+| `probes` | Changes to `probes.go` or capability probe definitions |
+| `agent` | Changes to any `agent_*.go` file |
+| `service` | Changes to `service.go` or `Options` |
+| `types` | Changes to `types.go` or `inference.go` interfaces |
+| `gguf` | Changes to `gguf.go` |
+
+Examples:
+
+```
+feat(backend): add ROCm backend via go-rocm InferenceAdapter
+fix(scoring): handle nil ContentScores when content probe not found
+refactor(agent): replace SSHCommand with SSHTransport.Run
+test(probes): add Check function coverage for all 23 probes
+```
+
+---
+
+## Co-Author and Licence
+
+Every commit must include:
+
+```
+Co-Authored-By: Virgil <virgil@lethean.io>
+```
+
+The licence is **EUPL-1.2**. All source files carry the SPDX identifier in the header. Do not add licence headers to test files; the package-level declaration covers them.
+
+---
+
+## Forge Remote
+
+The authoritative remote is `forge.lthn.ai/core/go-ml`:
+
+```bash
+git push forge main
+```
+
+The SSH remote URL is `ssh://git@forge.lthn.ai:2223/core/go-ml.git`. HTTPS authentication is not configured — always push via SSH.
diff --git a/docs/history.md b/docs/history.md
new file mode 100644
index 0000000..7fe54d1
--- /dev/null
+++ b/docs/history.md
@@ -0,0 +1,194 @@
+# go-ml Project History
+
+## Origin: Extraction from go-ai (19 February 2026)
+
+go-ml began as the `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. The monolith had grown to approximately 14,000 LOC and 53% of that was the ML subsystem. The ML code had zero internal dependencies on the rest of go-ai — it imported only `go-mlx` (external) and the Core `go` framework. The extraction was therefore clean: lift the directory, adjust the module path, and update the one import in go-ai that referenced it.
+
+**What was extracted:**
+
+- 41 Go source files (~7,494 LOC, excluding tests)
+- 6 test files covering backends, heuristic, judge, exact, probes, and score
+- All InfluxDB, DuckDB, Parquet, GGUF, and agent code
+
+**After extraction:**
+
+- go-ai dropped from ~14,000 to ~3,400 LOC (the `ai/` facade and `mcp/` hub remain there)
+- go-ml became an independent module at `forge.lthn.ai/core/go-ml`
+
+---
+
+## Phase 1: go-inference Migration (Complete)
+
+**Commit range:** `c3c2c14` (initial fix) through adapter and reverse adapter work.
+
+**Problem:** The original `backend_mlx.go` imported subpackages from go-mlx (`go-mlx/cache`, `go-mlx/model`, `go-mlx/sample`, `go-mlx/tokenizer`) that no longer existed after go-mlx's Phase 4 refactoring. The file was 253 LOC of hand-rolled tokenisation, KV cache management, sampling loops, and memory cleanup — and none of it compiled.
+
+**Solution:** Introduce `go-inference` as the abstraction layer between go-ml and hardware backends.
+
+### Step 1.1 — Add go-inference dependency
+
+Added `forge.lthn.ai/core/go-inference` to `go.mod` with a `replace` directive pointing to the local sibling checkout.
+
+### Step 1.2 — Write InferenceAdapter (`adapter.go`)
+
+Created `InferenceAdapter`, which wraps a `go-inference.TextModel` (returning `iter.Seq[Token]`) and exposes it as `ml.Backend` + `ml.StreamingBackend` (returning strings / calling `TokenCallback`). Thirteen test cases verified token collection, streaming, callback error propagation, context cancellation, empty output, and model errors after partial generation.
+
+Key design decision: after exhausting the iterator, `model.Err()` is checked separately. The iterator itself does not carry errors; partial output is returned alongside the error so callers can decide whether to use or discard it.
+
+### Step 1.3 — Rewrite `backend_mlx.go`
+
+Replaced 253 LOC with approximately 35 LOC. The blank import `_ "forge.lthn.ai/core/go-mlx"` registers the Metal backend via go-mlx's `init()`. `inference.LoadModel()` then handles model loading, and `InferenceAdapter` handles the rest.
+
+Memory controls (cache limits, memory limits) were deferred: go-mlx handles them internally, and callers that need explicit control can call `mlx.SetCacheLimit()` directly.
+
+### Step 1.4 — Reverse adapters (`backend_http_textmodel.go`)
+
+Added `HTTPTextModel` and `LlamaTextModel`, which wrap the existing `ml.Backend` implementations to satisfy `inference.TextModel`. This enables HTTP and llama-server backends to be used in packages (go-ai, go-i18n) that consume the go-inference interface. Since HTTP backends return complete strings rather than streaming tokens, each response is yielded as a single `Token`.
+
+17 tests for `HTTPTextModel` and 2 for `LlamaTextModel` all pass.
+
+### Step 1.5 — Downstream verification
+
+Confirmed that `service.go` (`Backend.Generate()`), `judge.go` (`judgeChat()`), and `go-ai/mcp/tools_ml.go` (`ml.Service`) required no changes — `InferenceAdapter` satisfies `ml.Backend`, and the existing consumers are unaffected.
+
+---
+
+## Phase 2: Backend Consolidation (Complete)
+
+**Commit range:** `747e703` (Message unification) through `convertOpts` extension.
+
+**Audit (Virgil, 20 February 2026):** Only two files in the entire ecosystem call `StreamingBackend` methods: `host-uk/cli/cmd/ml/cmd_serve.go` (SSE streaming at `/v1/completions` and `/v1/chat/completions`) and `cmd/ml/cmd_chat.go` (interactive terminal token echo). All other consumers use `Backend.Generate()` only.
+
+### Step 2.1 — Unify Message types
+
+`ml.Message` was a separate struct identical to `inference.Message`. Replaced with a type alias:
+
+```go
+type Message = inference.Message
+```
+
+This eliminated the `convertMessages()` helper from `adapter.go` and all explicit conversion sites. Backward-compatible: all existing callers continue to use `ml.Message` and compile unchanged.
+
+### Step 2.2 — Extend GenOpts
+
+Added `TopK`, `TopP`, and `RepeatPenalty` to `ml.GenOpts` to match the fields available in `inference.GenerateConfig`. Updated `convertOpts()` in `adapter.go` to map the new fields. Existing callers that only set `Temperature`, `MaxTokens`, and `Model` continue to work unchanged.
+
+**Field type note:** `inference.GenerateConfig` uses `float32` for temperature and sampling fields; `ml.GenOpts` uses `float64` to match the conventions in the rest of go-ml. `convertOpts()` performs the narrowing conversion explicitly.
+
+### Step 2.3 — Deprecate StreamingBackend
+
+Added deprecation comment to `StreamingBackend` in `inference.go`. The interface is not removed because `host-uk/cli` depends on it. Migration of those CLI files is out of scope for go-ml.
+
+### Step 2.4 — Document backend architecture
+
+Added the "Backend Architecture" section to `CLAUDE.md` documenting the two interface families, adapter directions, and migration guidance.
+
+---
+
+## Phase 3: Agent Loop Modernisation (Complete)
+
+The original `agent.go` was a 1,070 LOC file mixing SSH commands, InfluxDB line protocol construction, probe evaluation, checkpoint discovery, and JSONL buffering. It had zero tests.
+
+### Step 3.1 — Split into five files (Commit `eae9ec9`)
+
+| File | LOC | Contents |
+|------|-----|---------|
+| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
+| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `GetScoredLabels`, `FindUnscored`, `ProcessOne`, `isMLXNative` |
+| `agent_eval.go` | 397 | `processMLXNative`, `processWithConversion`, `RunCapabilityProbes`, `RunCapabilityProbesFull`, `RunContentProbes`, `ProbeResult` types |
+| `agent_influx.go` | 291 | `ScoreCapabilityAndPush`, `ScoreContentAndPush`, `PushCapability*`, `BufferInfluxResult`, `ReplayInfluxBuffer` |
+| `agent_ssh.go` | 102 | `SSHCommand`, `SCPFrom`, `SCPTo`, `fileBase`, `EnvOr`, `IntEnvOr`, `ExpandHome` |
+
+`go build ./...`, `go test ./...`, and `go vet ./...` all passed after the split.
+
+### Step 3.2 — Abstract SSH transport (Commit `1c2a6a6`)
+
+Introduced the `RemoteTransport` interface with `Run`, `CopyFrom`, and `CopyTo` methods. `SSHTransport` implements this interface using the system `ssh` and `scp` binaries with functional options (`WithPort`, `WithTimeout`). `AgentConfig.Transport` accepts any `RemoteTransport`, with lazy initialisation to an `SSHTransport` when nil.
+
+The old package-level functions `SSHCommand`, `SCPFrom`, and `SCPTo` are retained as deprecated wrappers that delegate to `AgentConfig.Transport`.
+
+### Step 3.3 — Extract hardcoded infrastructure (Commit `12f3a1c`)
+
+Extracted 15 constants from scattered magic values across 7 files:
+
+- `EpochBase` — InfluxDB timestamp origin (Unix timestamp for 15 February 2025 00:00 UTC)
+- Five InfluxDB measurement names (`MeasurementCapabilityScore`, `MeasurementCapabilityJudge`, `MeasurementContentScore`, `MeasurementProbeScore`, `MeasurementTrainingLoss`)
+- Two DuckDB table names (`TableCheckpointScores`, `TableProbeResults`)
+- Probe evaluation defaults (`CapabilityTemperature`, `CapabilityMaxTokens`, `ContentTemperature`, `ContentMaxTokens`, `MaxStoredResponseLen`)
+- `InfluxBufferFile` — JSONL buffer filename
+- `LogSeparatorWidth` — banner line width
+
+Hardcoded probe counts replaced with `len(CapabilityProbes)` and `len(ContentProbes)`.
+
+### Step 3.4 — Agent tests (Commit `3e22761`)
+
+First test coverage for the agent subsystem:
+
+- `AdapterMeta()` — 8 tests: known families (12 entries), variant suffixes, subdirectory patterns, unknown fallback, no-prefix edge case
+- `FindUnscored()` — 5 tests: all unscored (sorted), some scored, all scored, empty input, nil scored map
+- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — 4 tests: JSONL round-trip, multiple entries, empty file, missing file
+- `DiscoverCheckpoints()` — 6 tests using `fakeTransport`: 3 checkpoints across 2 dirs, subdirectory pattern, no adapters, SSH error, filter pattern, no safetensors files
+
+---
+
+## Phase 4: Test Coverage (Complete, Commit `09bf403`)
+
+Added four test files covering previously untested areas:
+
+**`backend_llama_test.go`** (20 tests) — Uses `net/http/httptest` to mock the llama-server HTTP API. Covers: `Name`, `Available` (4 variants including process-not-started and health endpoint failure), `Generate` (6 variants including context cancellation, empty choices, and opts forwarding), `Chat` (3 variants), `Stop`, constructor (4 variants), and interface compliance.
+
+**`backend_mlx_test.go`** (8 tests) — Uses a mock `inference.TextModel`. No build tag required — tests run on all platforms without Metal GPU hardware. Covers: `Generate`, `Chat`, streaming, model error after partial output, `Close`, direct model access via `Model()`, interface compliance, and `convertOpts` field mapping.
+
+**`score_race_test.go`** (6 tests) — Race condition tests run with `-race`:
+- `ConcurrentSemantic` — 20 responses scored with concurrency=4; verifies no data races on the result map
+- `ConcurrentMixedSuites` — semantic + standard + content fan-out simultaneously
+- `SemaphoreBoundary` — concurrency=1; verifies that at most 1 goroutine holds the semaphore at once
+- `ContextCancellation` — 400 error response from judge returns nil semantic score without panicking
+- `HeuristicOnlyNoRace` — 50 responses, heuristic only (no goroutines spawned); regression check
+- `MultiModelConcurrent` — 4 models × 5 concurrent goroutines writing to the results map
+
+**`benchmark_test.go`** (25 benchmarks, baselines on M3 Ultra):
+- `HeuristicScore` — 5 input sizes (100–10,000 characters): 25µs–8.8ms
+- `ExactMatch` — 4 patterns: 171ns–2.1µs
+- `JudgeExtractJSON` — 6 response variants: 2.5–3.4µs
+- `Judge` round-trip — 2 suites (semantic, content): ~52µs
+- `ScoreAll` — 2 modes (heuristic only, full): 25µs–4.5ms
+- Sub-components — 5 heuristic stages: 244ns–88µs
+
+---
+
+## Known Limitations
+
+### StreamingBackend retention
+
+`ml.StreamingBackend` cannot be removed until `host-uk/cli/cmd/ml/cmd_serve.go` and `cmd/ml/cmd_chat.go` are migrated to use `inference.TextModel` iterators directly. That migration is out of scope for go-ml and must be tracked in the `host-uk/cli` repository.
+
+### LlamaTextModel streaming gap
+
+`LlamaTextModel` implements `inference.TextModel` but does not actually stream tokens — it yields the complete llama-server HTTP response as a single `Token`. True token-level streaming from llama-server would require implementing SSE parsing, which is a separate effort.
+
+### Agent infrastructure coupling
+
+`AgentConfig` contains fields (`M3Host`, `M3User`, `M3SSHKey`, `M3AdapterBase`, `InfluxURL`, `InfluxDB`) that are tightly coupled to a specific deployment topology (M3 Mac + InfluxDB on `10.69.69.165`). While the `RemoteTransport` abstraction decouples tests from SSH, production deployments still hardcode the M3 as the checkpoint host.
+
+### EpochBase timestamp
+
+The `EpochBase` constant (`1739577600`, corresponding to 15 February 2025 00:00 UTC) is embedded in InfluxDB line protocol timestamps. All capability/content/probe timestamps derive from this base plus checkpoint iteration offsets. Changing `EpochBase` would require re-writing all historical InfluxDB data.
+
+### HTTPBackend classify
+
+`HTTPTextModel.Classify` returns an "unsupported" error. There is no path to add classification support to an OpenAI-compatible HTTP backend without a dedicated classification endpoint or prompt engineering.
+
+### DuckDB CGo
+
+The `go-duckdb` dependency requires CGo. This prevents cross-compilation from macOS to Linux without a cross-compilation toolchain. Binaries that import go-ml will require a C compiler at build time.
+
+---
+
+## Future Considerations
+
+- **ROCm backend** — `go-rocm` provides a llama-server subprocess backend for AMD GPUs. Once published, it can be wrapped with `InferenceAdapter` in the same pattern as `backend_mlx.go`, gated with a `//go:build linux && amd64` constraint.
+- **StreamingBackend removal** — Once `host-uk/cli` is migrated to `iter.Seq[Token]`, the `StreamingBackend` interface and `InferenceAdapter`'s `GenerateStream`/`ChatStream` methods can be removed.
+- **go-i18n integration** — go-i18n Phase 2a requires 5,000 sentences/second classification throughput from Gemma3-1B. The `InferenceAdapter` and `inference.TextModel.BatchGenerate` provide the interface; the performance target depends on go-mlx's batching implementation.
+- **LEM Lab pipeline wiring** — Integration tests for `backend_mlx.go` with a real model are deferred until the LEM Lab inference pipeline is fully wired. A smoke test against a small quantised model would confirm end-to-end Metal GPU inference through the go-inference abstraction.
+- **Charm SSH** — The `SSHTransport` currently shells out to the system `ssh` and `scp` binaries. Replacing these with pure-Go SSH via `charmbracelet/keygen` and a native SSH client would eliminate the subprocess dependency and improve testability.