docs: comprehensive domain expert session brief for go-inference migration

Rewrites CLAUDE.md with full interface mapping (ml.Backend → inference.TextModel), adapter design pattern, broken import context, and local dependency paths. Expands TODO.md Phase 1 into 5 concrete steps with code patterns for InferenceAdapter, backend_mlx.go rewrite, and downstream verification. Updates FINDINGS.md with type correspondence table and before/after comparison. Fixes go.mod replace directives for ~/Code/ layout and adds go-inference. Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:55:50 +00:00 · 2026-02-19 21:55:50 +00:00 · 84757b8331
commit 84757b8331
parent 88e926cc24
4 changed files with 370 additions and 146 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -1,72 +1,162 @@
-# CLAUDE.md
+# CLAUDE.md — go-ml Domain Expert Guide

-## What This Is
+You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates your work via TODO.md. Pick up tasks in phase order, mark `[x]` when done, commit and push.

-ML inference backends, scoring engine, and agent orchestrator. Module: `forge.lthn.ai/core/go-ml`
+## What This Package Does

-Provides pluggable inference backends (MLX/Metal, llama.cpp, HTTP/Ollama), a multi-suite scoring engine with ethics-aware probes, GGUF model management, and a concurrent worker pipeline for batch evaluation.
+ML inference backends, scoring engine, and agent orchestrator. 7.5K LOC across 41 Go files. Provides:
+
+- **Pluggable inference backends** — MLX/Metal (darwin/arm64), llama.cpp (subprocess), HTTP/Ollama (OpenAI-compatible)
+- **Multi-suite scoring engine** — Heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K)
+- **23 capability probes** — Binary pass/fail tests across 16 categories (math, logic, code, etc.)
+- **GGUF model management** — Format parsing, conversion, inventory
+- **Agent orchestrator** — SSH checkpoint discovery, InfluxDB streaming, batch evaluation
+
+## Critical Context: go-inference Migration
+
+**This is the #1 priority.** Phase 1 in TODO.md.
+
+The package currently defines its own `Backend` interface that returns `(string, error)`. The shared `go-inference` package defines `TextModel` which returns `iter.Seq[Token]` (Go 1.23+ range-over-func). Everything downstream is blocked until go-ml bridges these two interfaces.
+
+### Interface Gap
+
+```
+go-ml (CURRENT)                         go-inference (TARGET)
+─────────────────                        ─────────────────────
+Backend.Generate(ctx, prompt, GenOpts)   TextModel.Generate(ctx, prompt, ...GenerateOption)
+  → (string, error)                        → iter.Seq[Token]
+
+Backend.Chat(ctx, messages, GenOpts)     TextModel.Chat(ctx, messages, ...GenerateOption)
+  → (string, error)                        → iter.Seq[Token]
+
+StreamingBackend.GenerateStream(         (streaming is built-in via iter.Seq)
+  ctx, prompt, opts, TokenCallback)
+  → error
+
+GenOpts{Temperature, MaxTokens, Model}   GenerateConfig{MaxTokens, Temperature,
+                                           TopK, TopP, StopTokens, RepeatPenalty}
+                                         (configured via WithMaxTokens(n) etc.)
+```
+
+### What the Adapter Must Do
+
+```go
+// InferenceAdapter wraps go-inference.TextModel to satisfy ml.Backend + ml.StreamingBackend.
+// This is the bridge between the new iterator-based API and the legacy string-return API.
+type InferenceAdapter struct {
+    model inference.TextModel
+}
+
+// Generate collects all tokens from the iterator into a string.
+func (a *InferenceAdapter) Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) {
+    genOpts := convertOpts(opts) // GenOpts → []inference.GenerateOption
+    var buf strings.Builder
+    for tok := range a.model.Generate(ctx, prompt, genOpts...) {
+        buf.WriteString(tok.Text)
+    }
+    if err := a.model.Err(); err != nil {
+        return buf.String(), err
+    }
+    return buf.String(), nil
+}
+
+// GenerateStream yields tokens to the callback as they arrive.
+func (a *InferenceAdapter) GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error {
+    genOpts := convertOpts(opts)
+    for tok := range a.model.Generate(ctx, prompt, genOpts...) {
+        if err := cb(tok.Text); err != nil {
+            return err
+        }
+    }
+    return a.model.Err()
+}
+```
+
+### backend_mlx.go Is Broken
+
+After go-mlx Phase 4, the old subpackage imports no longer exist:
+- `forge.lthn.ai/core/go-mlx/cache` — **REMOVED** (now `internal/metal`)
+- `forge.lthn.ai/core/go-mlx/model` — **REMOVED** (now `internal/metal`)
+- `forge.lthn.ai/core/go-mlx/sample` — **REMOVED** (now `internal/metal`)
+- `forge.lthn.ai/core/go-mlx/tokenizer` — **REMOVED** (now `internal/metal`)
+
+The new go-mlx public API is:
+```go
+import (
+    "forge.lthn.ai/core/go-inference"
+    _ "forge.lthn.ai/core/go-mlx"  // registers "metal" backend via init()
+)
+
+m, err := inference.LoadModel("/path/to/model/", inference.WithContextLen(4096))
+defer m.Close()
+for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
+    fmt.Print(tok.Text)
+}
+```
+
+**The rewrite**: Delete the 253 LOC of manual tokenisation/KV cache/sampling. Replace with ~60 LOC that loads via go-inference and wraps in `InferenceAdapter`.

 ## Commands

 ```bash
-go test ./...                    # Run all tests
+go test ./...                    # Run all tests (some will fail until Phase 1)
 go test -v -run TestHeuristic    # Single test
-go test -bench=. ./...           # Benchmarks
+go test -bench=. ./...           # Benchmarks (none exist yet)
+go test -race ./...              # Race detector
+go vet ./...                     # Static analysis
 ```

+**Note**: `backend_mlx.go` won't compile until rewritten (Phase 1). Use build tags to skip:
+```bash
+go test -tags '!darwin' ./...    # Skip MLX tests on non-darwin
+```
+
+## Local Dependencies
+
+All resolve via `replace` directives in go.mod:
+
+| Module | Local Path | Notes |
+|--------|-----------|-------|
+| `forge.lthn.ai/core/go` | `../host-uk/core` | Framework (ServiceRuntime, process, log) |
+| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend (darwin/arm64 only) |
+| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Backend interfaces |
+
 ## Architecture

 ### Backends (pluggable inference)

-| File | Backend | Notes |
-|------|---------|-------|
-| `backend_mlx.go` | MLX/Metal GPU | Native Apple Silicon via go-mlx (darwin/arm64 only) |
-| `backend_llama.go` | llama.cpp | GGUF models via subprocess |
-| `backend_http.go` | HTTP API | Generic (Ollama, vLLM, OpenAI-compatible) |
-| `ollama.go` | Ollama helpers | Ollama-specific client utilities |
+| File | Backend | Status |
+|------|---------|--------|
+| `backend_mlx.go` | MLX/Metal GPU | **BROKEN** — old imports, needs Phase 1 rewrite |
+| `backend_llama.go` | llama-server subprocess | Works, needs go-inference wrapper |
+| `backend_http.go` | HTTP API (OpenAI-compatible) | Works, needs go-inference wrapper |
+| `ollama.go` | Ollama helpers | Works |

 ### Scoring Engine

-| File | Purpose |
-|------|---------|
-| `score.go` | Main scoring orchestrator |
-| `heuristic.go` | Fast rule-based scoring (no LLM needed) |
-| `judge.go` | LLM-as-judge evaluator |
-| `exact.go` | Exact match scoring (GSM8K-style) |
-| `probes.go` | Ethics-aware evaluation probes |
+| File | LOC | Purpose |
+|------|-----|---------|
+| `score.go` | 212 | Concurrent scoring orchestrator (semaphore-bounded workers) |
+| `heuristic.go` | 258 | 9 regex-based metrics, LEK composite score |
+| `judge.go` | 205 | LLM-as-judge (6 scoring methods) |
+| `exact.go` | 77 | GSM8K exact-match with numeric extraction |
+| `probes.go` | 273 | 23 binary capability probes across 16 categories |

 ### Data Pipeline

-| File | Purpose |
-|------|---------|
-| `agent.go` (1,070 LOC) | LLM agent orchestrator (largest file) |
-| `worker.go` | Concurrent worker pool for multi-model scoring |
-| `ingest.go` | Bulk data ingestion |
-| `import_all.go` | Import orchestration |
-| `gguf.go` | GGUF model handling and inventory |
-| `convert.go` | Model format conversion |
-| `db.go` | DuckDB storage layer |
-| `parquet.go` | Parquet I/O |
+| File | LOC | Purpose |
+|------|-----|---------|
+| `agent.go` | 1,070 | Scoring agent (SSH checkpoint discovery, InfluxDB) |
+| `worker.go` | 403 | LEM API worker for distributed inference |
+| `service.go` | 162 | Core framework integration (lifecycle, backend registry) |
+| `ingest.go` | 384 | JSONL response loading |
+| `db.go` | 258 | DuckDB analytics storage |
+| `gguf.go` | 369 | GGUF model format parsing |

-### Monitoring
-
-| File | Purpose |
-|------|---------|
-| `metrics.go` | Metrics tracking |
-| `influx.go` | InfluxDB integration |
-| `status.go` | Status reporting |
-
-## Dependencies
-
- `forge.lthn.ai/core/go` — Framework (ServiceRuntime, process, log)
- `forge.lthn.ai/core/go-mlx` — Native Metal GPU inference
- `github.com/marcboeker/go-duckdb` — Embedded analytics DB
- `github.com/parquet-go/parquet-go` — Columnar data format
-
-## Key Interfaces
+### Key Types

 ```go
-// Backend — pluggable inference
+// Current backend interface (inference.go)
 type Backend interface {
    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
@ -74,23 +164,39 @@ type Backend interface {
    Available() bool
 }

-// StreamingBackend — extends Backend with token streaming
 type StreamingBackend interface {
    Backend
    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
 }
+
+type GenOpts struct {
+    Temperature float64
+    MaxTokens   int
+    Model       string
+}
+
+type Message struct {
+    Role    string `json:"role"`
+    Content string `json:"content"`
+}
 ```

 ## Coding Standards

- UK English
- Tests: testify assert/require
- Conventional commits
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2
+- **UK English**: colour, organisation, centre
+- **Tests**: testify assert/require (existing), Pest-style names welcome for new tests
+- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(mlx):`
+- **Co-Author**: `Co-Authored-By: Virgil <virgil@lethean.io>`
+- **Licence**: EUPL-1.2
+- **Imports**: stdlib → forge.lthn.ai → third-party, each group separated by blank line
+
+## Forge
+
+- **Repo**: `forge.lthn.ai/core/go-ml`
+- **Push via SSH**: `git push forge main` (remote: `ssh://git@forge.lthn.ai:2223/core/go-ml.git`)

 ## Task Queue

-See `TODO.md` for prioritised work.
-See `FINDINGS.md` for research notes.
+See `TODO.md` for prioritised work. Phase 1 (go-inference migration) is the critical path.
+See `FINDINGS.md` for research notes and interface mapping.
--- a/FINDINGS.md
+++ b/FINDINGS.md
@ -15,6 +15,7 @@ Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/c
 ### Dependencies

 - `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only)
+- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1)
 - `forge.lthn.ai/core/go` — Framework services, process management, logging
 - `github.com/marcboeker/go-duckdb` — Analytics storage
 - `github.com/parquet-go/parquet-go` — Columnar data I/O
@ -22,79 +23,108 @@ Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/c

 ### Consumers

- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools
- `go-ai/test-mlx.go` — Integration test utility
+- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`)
 - LEM Lab — Uses MLXBackend for chat inference
+- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference)

-## Architecture
+## go-inference Interface Mapping

-### Backend Interface
+### Type Correspondence

+| go-ml | go-inference | Notes |
+|-------|-------------|-------|
+| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel |
+| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming |
+| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. |
+| `ml.Message` | `inference.Message` | Identical struct: Role + Content |
+| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks |
+| (no equivalent) | `inference.Token` | `{ID int32, Text string}` |
+| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] |
+
+### Method Mapping
+
+```
+ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error)
+   ↕ InferenceAdapter collects tokens
+inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
+
+ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error
+   ↕ InferenceAdapter forwards tokens to callback
+inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
+
+ml.GenOpts{Temperature: 0.7, MaxTokens: 2048}
+   ↕ convertOpts helper
+inference.WithTemperature(0.7), inference.WithMaxTokens(2048)
+```
+
+### backend_mlx.go Before/After
+
+**Before** (253 LOC — BROKEN, old subpackage imports):
 ```go
-type Backend interface {
-    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
-    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
-    Name() string
-    Available() bool
-}
+import (
+    "forge.lthn.ai/core/go-mlx"
+    "forge.lthn.ai/core/go-mlx/cache"    // REMOVED
+    "forge.lthn.ai/core/go-mlx/model"    // REMOVED
+    "forge.lthn.ai/core/go-mlx/sample"   // REMOVED
+    "forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED
+)

-type StreamingBackend interface {
-    Backend
-    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
-    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
+type MLXBackend struct {
+    model      model.Model
+    tok        *tokenizer.Tokenizer
+    caches     []cache.Cache
+    sampler    sample.Sampler
+    // ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup
 }
 ```

-Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based.
+**After** (~60 LOC — uses go-inference + InferenceAdapter):
+```go
+import (
+    "forge.lthn.ai/core/go-inference"
+    _ "forge.lthn.ai/core/go-mlx"  // registers "metal" backend via init()
+)

-### Scoring Engine
+func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
+    m, err := inference.LoadModel(modelPath)
+    if err != nil { return nil, fmt.Errorf("mlx: %w", err) }
+    return &InferenceAdapter{model: m, name: "mlx"}, nil
+}
+```

-Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results.
+All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface.

-**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity.
+## Scoring Engine Architecture

-**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality.
+### 5 Suites

-**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness.
+| Suite | Method | LLM needed? | Metrics |
+|-------|--------|-------------|---------|
+| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite |
+| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) |
+| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) |
+| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen |
+| **Exact** | Numeric extraction | No | GSM8K answer matching |

-**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning).
+### LEK Score Formula

-**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance.
+```
+LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
+    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
+```

-### 23 Capability Probes
+Positive signals: engagement depth, creative form, emotional register, first-person voice.
+Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output.

-16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought.
+### Concurrency Model

-### InfluxDB Integration
-
- Endpoint: `10.69.69.165:8181`
- Database: `training`
- Protocol: Line protocol writes (hand-rolled, no official client)
- Purpose: Streaming checkpoint scores during agent evaluation runs
-
-### Data Pipeline
-
-DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion.
-
-## go-inference Gap
-
-This is the critical finding driving Phase 1.
-
-**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`.
-
-**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func).
-
-**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means:
-1. MLX backend duplicates logic that go-inference already provides
-2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface
-3. Scoring engine is locked to the legacy string-return interface
-
-**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md.
+`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex.

 ## Known Issues

- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration.
- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate.
- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable.
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends.
- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests.
+- **backend_mlx.go imports dead subpackages** — Blocked on Phase 1 migration
+- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together
+- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go
+- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists
+- **score.go concurrency untested** — No race condition tests
+- **Message type duplication** — `ml.Message` and `inference.Message` are identical but separate
--- a/TODO.md
+++ b/TODO.md
@ -1,46 +1,132 @@
 # TODO.md — go-ml Task Queue

-## Phase 1: go-inference Migration
-
-The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC.
-
- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`.
- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams.
- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens.
- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface.
-
-## Phase 2: Backend Consolidation
-
-`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy.
-
- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`.
- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`.
- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase.
- [ ] **Unify GenOpts** — `ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers.
-
-## Phase 3: Agent Loop Modernisation
-
-`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files.
-
- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming).
- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable.
- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library.
- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants.
-
-## Phase 4: Test Coverage
-
-`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests.
-
- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle.
- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors.
- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score.
- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently.
+Dispatched from Virgil in core/go. Pick up tasks in phase order.

 ---

-## Standing: Workflow
+## Phase 1: go-inference Migration (CRITICAL PATH)
+
+Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring.
+
+### Step 1.1: Add go-inference dependency
+
+- [ ] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves.
+
+### Step 1.2: Write the InferenceAdapter
+
+- [ ] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement:
+  - `Generate()` — collect tokens from iterator into string
+  - `Chat()` — same, using `TextModel.Chat()`
+  - `GenerateStream()` — forward tokens to `TokenCallback`
+  - `ChatStream()` — same for chat
+  - `Name()` — delegate to `TextModel.ModelType()`
+  - `Available()` — always true (model already loaded)
+  - `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options
+
+  **Key mapping**:
+  ```
+  GenOpts.Temperature → inference.WithTemperature(float32(t))
+  GenOpts.MaxTokens   → inference.WithMaxTokens(n)
+  GenOpts.Model       → (ignored, model already loaded)
+  ```
+
+  **Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled).
+
+- [ ] **Test adapter.go** — Test with a mock `inference.TextModel` that yields predetermined tokens. Test cases:
+  - Normal generation (collect tokens → string)
+  - Streaming (each token hits callback)
+  - Callback error stops iteration
+  - Context cancellation propagates
+  - Empty output (EOS immediately)
+  - Model error after partial output
+
+### Step 1.3: Rewrite backend_mlx.go
+
+- [ ] **Replace backend_mlx.go** — Delete the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replace with ~60 LOC:
+  ```go
+  //go:build darwin && arm64
+
+  package ml
+
+  import (
+      "forge.lthn.ai/core/go-inference"
+      _ "forge.lthn.ai/core/go-mlx"  // registers "metal" backend
+  )
+
+  func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
+      m, err := inference.LoadModel(modelPath)
+      if err != nil {
+          return nil, fmt.Errorf("mlx: %w", err)
+      }
+      return &InferenceAdapter{model: m, name: "mlx"}, nil
+  }
+  ```
+  The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic.
+
+- [ ] **Preserve memory controls** — The old `MLXBackend` set cache/memory limits (16GB/24GB). These should be configurable. Options:
+  - Accept memory limits in `NewMLXBackend` params
+  - Or set them in `InferenceAdapter` wrapper
+  - go-mlx exposes `SetCacheLimit()` / `SetMemoryLimit()` at package level
+
+- [ ] **Test backend_mlx.go** — Verify the new backend can:
+  - Load a model via go-inference registry
+  - Generate text (smoke test, requires model on disk)
+  - Stream tokens via callback
+  - Handle Metal availability check (build tag gating)
+
+### Step 1.4: HTTPBackend and LlamaBackend wrappers
+
+- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend already works fine as `ml.Backend`. For go-inference compatibility, write a thin wrapper that implements `inference.TextModel`:
+  - `Generate()` calls HTTP API, yields entire response as single Token
+  - `Chat()` same
+  - This is lower priority than MLX — HTTP backends don't need the full iter.Seq pattern
+  - Consider SSE streaming: `/v1/chat/completions` with `"stream": true` returns SSE events that CAN be yielded as `iter.Seq[Token]`
+
+- [ ] **LlamaBackend go-inference wrapper** — LlamaBackend delegates to HTTPBackend already. Same treatment.
+
+### Step 1.5: Verify downstream consumers
+
+- [ ] **Service.Generate() still works** — `service.go` calls `Backend.Generate()`. After migration, backends wrapped in `InferenceAdapter` must still satisfy `ml.Backend`.
+- [ ] **Judge still works** — `judge.go` uses `Backend.Generate()` for LLM-as-judge. Verify scoring pipeline runs end-to-end.
+- [ ] **go-ai tools_ml.go** — Uses `ml.Service` directly. No code changes needed in go-ai if `ml.Backend` interface is preserved.
+
+---
+
+## Phase 2: Backend Consolidation
+
+After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile.
+
+- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate to `iter.Seq[Token]`.
+- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated.
+- [ ] **Unify GenOpts** — `ml.GenOpts` and `inference.GenerateConfig` overlap. Add `convertOpts()` in Phase 1, consolidate into one struct later.
+- [ ] **Unify Message types** — `ml.Message` and `inference.Message` are identical structs. Consider type alias or shared import.
+
+---
+
+## Phase 3: Agent Loop Modernisation
+
+`agent.go` (1,070 LOC) is the largest file. Decompose.
+
+- [ ] **Split agent.go** — Into: `agent_config.go` (config, model maps), `agent_execute.go` (run loop, checkpoint processing), `agent_eval.go` (probe evaluation, result publishing), `agent_influx.go` (InfluxDB streaming, JSONL buffer).
+- [ ] **Abstract SSH transport** — Extract SSH checkpoint discovery into interface. Current M3 homelab SSH may change to Linux (go-rocm).
+- [ ] **Configurable endpoints** — `10.69.69.165:8181` and M3 SSH details hardcoded. Move to config/environment.
+- [ ] **InfluxDB client** — Hand-rolled line protocol. Evaluate official InfluxDB Go client.
+
+---
+
+## Phase 4: Test Coverage
+
+- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, health checks, process lifecycle.
+- [ ] **backend_mlx_test.go** — After Phase 1 rewrite, test with mock go-inference TextModel.
+- [ ] **score.go race tests** — `go test -race ./...`. Concurrent scoring, semaphore boundaries, context cancellation.
+- [ ] **Benchmark suite** — `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes.
+
+---
+
+## Workflow

 1. Virgil in core/go writes tasks here after research
 2. This repo's session picks up tasks in phase order
 3. Mark `[x]` when done, note commit hash
-4. Phase 1 is the critical path — everything else builds on go-inference migration
+4. New discoveries → add tasks, note in FINDINGS.md
+5. Push to forge after each completed step: `git push forge main`
--- a/go.mod
+++ b/go.mod
@ -31,6 +31,8 @@ require (
 	golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da // indirect
 )

-replace forge.lthn.ai/core/go => ../core
+replace forge.lthn.ai/core/go => ../host-uk/core

 replace forge.lthn.ai/core/go-mlx => ../go-mlx
+
+replace forge.lthn.ai/core/go-inference => ../go-inference