docs: comprehensive domain expert session brief for go-inference migration

Rewrites CLAUDE.md with full interface mapping (ml.Backend → inference.TextModel),
adapter design pattern, broken import context, and local dependency paths.

Expands TODO.md Phase 1 into 5 concrete steps with code patterns for
InferenceAdapter, backend_mlx.go rewrite, and downstream verification.

Updates FINDINGS.md with type correspondence table and before/after comparison.

Fixes go.mod replace directives for ~/Code/ layout and adds go-inference.

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Snider 2026-02-19 21:55:50 +00:00
parent 88e926cc24
commit 84757b8331
4 changed files with 370 additions and 146 deletions

214
CLAUDE.md
View file

@ -1,72 +1,162 @@
# CLAUDE.md
# CLAUDE.md — go-ml Domain Expert Guide
## What This Is
You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates your work via TODO.md. Pick up tasks in phase order, mark `[x]` when done, commit and push.
ML inference backends, scoring engine, and agent orchestrator. Module: `forge.lthn.ai/core/go-ml`
## What This Package Does
Provides pluggable inference backends (MLX/Metal, llama.cpp, HTTP/Ollama), a multi-suite scoring engine with ethics-aware probes, GGUF model management, and a concurrent worker pipeline for batch evaluation.
ML inference backends, scoring engine, and agent orchestrator. 7.5K LOC across 41 Go files. Provides:
- **Pluggable inference backends** — MLX/Metal (darwin/arm64), llama.cpp (subprocess), HTTP/Ollama (OpenAI-compatible)
- **Multi-suite scoring engine** — Heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K)
- **23 capability probes** — Binary pass/fail tests across 16 categories (math, logic, code, etc.)
- **GGUF model management** — Format parsing, conversion, inventory
- **Agent orchestrator** — SSH checkpoint discovery, InfluxDB streaming, batch evaluation
## Critical Context: go-inference Migration
**This is the #1 priority.** Phase 1 in TODO.md.
The package currently defines its own `Backend` interface that returns `(string, error)`. The shared `go-inference` package defines `TextModel` which returns `iter.Seq[Token]` (Go 1.23+ range-over-func). Everything downstream is blocked until go-ml bridges these two interfaces.
### Interface Gap
```
go-ml (CURRENT) go-inference (TARGET)
───────────────── ─────────────────────
Backend.Generate(ctx, prompt, GenOpts) TextModel.Generate(ctx, prompt, ...GenerateOption)
→ (string, error) → iter.Seq[Token]
Backend.Chat(ctx, messages, GenOpts) TextModel.Chat(ctx, messages, ...GenerateOption)
→ (string, error) → iter.Seq[Token]
StreamingBackend.GenerateStream( (streaming is built-in via iter.Seq)
ctx, prompt, opts, TokenCallback)
→ error
GenOpts{Temperature, MaxTokens, Model} GenerateConfig{MaxTokens, Temperature,
TopK, TopP, StopTokens, RepeatPenalty}
(configured via WithMaxTokens(n) etc.)
```
### What the Adapter Must Do
```go
// InferenceAdapter wraps go-inference.TextModel to satisfy ml.Backend + ml.StreamingBackend.
// This is the bridge between the new iterator-based API and the legacy string-return API.
type InferenceAdapter struct {
model inference.TextModel
}
// Generate collects all tokens from the iterator into a string.
func (a *InferenceAdapter) Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) {
genOpts := convertOpts(opts) // GenOpts → []inference.GenerateOption
var buf strings.Builder
for tok := range a.model.Generate(ctx, prompt, genOpts...) {
buf.WriteString(tok.Text)
}
if err := a.model.Err(); err != nil {
return buf.String(), err
}
return buf.String(), nil
}
// GenerateStream yields tokens to the callback as they arrive.
func (a *InferenceAdapter) GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error {
genOpts := convertOpts(opts)
for tok := range a.model.Generate(ctx, prompt, genOpts...) {
if err := cb(tok.Text); err != nil {
return err
}
}
return a.model.Err()
}
```
### backend_mlx.go Is Broken
After go-mlx Phase 4, the old subpackage imports no longer exist:
- `forge.lthn.ai/core/go-mlx/cache`**REMOVED** (now `internal/metal`)
- `forge.lthn.ai/core/go-mlx/model`**REMOVED** (now `internal/metal`)
- `forge.lthn.ai/core/go-mlx/sample`**REMOVED** (now `internal/metal`)
- `forge.lthn.ai/core/go-mlx/tokenizer`**REMOVED** (now `internal/metal`)
The new go-mlx public API is:
```go
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-mlx" // registers "metal" backend via init()
)
m, err := inference.LoadModel("/path/to/model/", inference.WithContextLen(4096))
defer m.Close()
for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
fmt.Print(tok.Text)
}
```
**The rewrite**: Delete the 253 LOC of manual tokenisation/KV cache/sampling. Replace with ~60 LOC that loads via go-inference and wraps in `InferenceAdapter`.
## Commands
```bash
go test ./... # Run all tests
go test ./... # Run all tests (some will fail until Phase 1)
go test -v -run TestHeuristic # Single test
go test -bench=. ./... # Benchmarks
go test -bench=. ./... # Benchmarks (none exist yet)
go test -race ./... # Race detector
go vet ./... # Static analysis
```
**Note**: `backend_mlx.go` won't compile until rewritten (Phase 1). Use build tags to skip:
```bash
go test -tags '!darwin' ./... # Skip MLX tests on non-darwin
```
## Local Dependencies
All resolve via `replace` directives in go.mod:
| Module | Local Path | Notes |
|--------|-----------|-------|
| `forge.lthn.ai/core/go` | `../host-uk/core` | Framework (ServiceRuntime, process, log) |
| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend (darwin/arm64 only) |
| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Backend interfaces |
## Architecture
### Backends (pluggable inference)
| File | Backend | Notes |
|------|---------|-------|
| `backend_mlx.go` | MLX/Metal GPU | Native Apple Silicon via go-mlx (darwin/arm64 only) |
| `backend_llama.go` | llama.cpp | GGUF models via subprocess |
| `backend_http.go` | HTTP API | Generic (Ollama, vLLM, OpenAI-compatible) |
| `ollama.go` | Ollama helpers | Ollama-specific client utilities |
| File | Backend | Status |
|------|---------|--------|
| `backend_mlx.go` | MLX/Metal GPU | **BROKEN** — old imports, needs Phase 1 rewrite |
| `backend_llama.go` | llama-server subprocess | Works, needs go-inference wrapper |
| `backend_http.go` | HTTP API (OpenAI-compatible) | Works, needs go-inference wrapper |
| `ollama.go` | Ollama helpers | Works |
### Scoring Engine
| File | Purpose |
|------|---------|
| `score.go` | Main scoring orchestrator |
| `heuristic.go` | Fast rule-based scoring (no LLM needed) |
| `judge.go` | LLM-as-judge evaluator |
| `exact.go` | Exact match scoring (GSM8K-style) |
| `probes.go` | Ethics-aware evaluation probes |
| File | LOC | Purpose |
|------|-----|---------|
| `score.go` | 212 | Concurrent scoring orchestrator (semaphore-bounded workers) |
| `heuristic.go` | 258 | 9 regex-based metrics, LEK composite score |
| `judge.go` | 205 | LLM-as-judge (6 scoring methods) |
| `exact.go` | 77 | GSM8K exact-match with numeric extraction |
| `probes.go` | 273 | 23 binary capability probes across 16 categories |
### Data Pipeline
| File | Purpose |
|------|---------|
| `agent.go` (1,070 LOC) | LLM agent orchestrator (largest file) |
| `worker.go` | Concurrent worker pool for multi-model scoring |
| `ingest.go` | Bulk data ingestion |
| `import_all.go` | Import orchestration |
| `gguf.go` | GGUF model handling and inventory |
| `convert.go` | Model format conversion |
| `db.go` | DuckDB storage layer |
| `parquet.go` | Parquet I/O |
| File | LOC | Purpose |
|------|-----|---------|
| `agent.go` | 1,070 | Scoring agent (SSH checkpoint discovery, InfluxDB) |
| `worker.go` | 403 | LEM API worker for distributed inference |
| `service.go` | 162 | Core framework integration (lifecycle, backend registry) |
| `ingest.go` | 384 | JSONL response loading |
| `db.go` | 258 | DuckDB analytics storage |
| `gguf.go` | 369 | GGUF model format parsing |
### Monitoring
| File | Purpose |
|------|---------|
| `metrics.go` | Metrics tracking |
| `influx.go` | InfluxDB integration |
| `status.go` | Status reporting |
## Dependencies
- `forge.lthn.ai/core/go` — Framework (ServiceRuntime, process, log)
- `forge.lthn.ai/core/go-mlx` — Native Metal GPU inference
- `github.com/marcboeker/go-duckdb` — Embedded analytics DB
- `github.com/parquet-go/parquet-go` — Columnar data format
## Key Interfaces
### Key Types
```go
// Backend — pluggable inference
// Current backend interface (inference.go)
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
@ -74,23 +164,39 @@ type Backend interface {
Available() bool
}
// StreamingBackend — extends Backend with token streaming
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
type GenOpts struct {
Temperature float64
MaxTokens int
Model string
}
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
```
## Coding Standards
- UK English
- Tests: testify assert/require
- Conventional commits
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2
- **UK English**: colour, organisation, centre
- **Tests**: testify assert/require (existing), Pest-style names welcome for new tests
- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(mlx):`
- **Co-Author**: `Co-Authored-By: Virgil <virgil@lethean.io>`
- **Licence**: EUPL-1.2
- **Imports**: stdlib → forge.lthn.ai → third-party, each group separated by blank line
## Forge
- **Repo**: `forge.lthn.ai/core/go-ml`
- **Push via SSH**: `git push forge main` (remote: `ssh://git@forge.lthn.ai:2223/core/go-ml.git`)
## Task Queue
See `TODO.md` for prioritised work.
See `FINDINGS.md` for research notes.
See `TODO.md` for prioritised work. Phase 1 (go-inference migration) is the critical path.
See `FINDINGS.md` for research notes and interface mapping.

View file

@ -15,6 +15,7 @@ Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/c
### Dependencies
- `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only)
- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1)
- `forge.lthn.ai/core/go` — Framework services, process management, logging
- `github.com/marcboeker/go-duckdb` — Analytics storage
- `github.com/parquet-go/parquet-go` — Columnar data I/O
@ -22,79 +23,108 @@ Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/c
### Consumers
- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools
- `go-ai/test-mlx.go` — Integration test utility
- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`)
- LEM Lab — Uses MLXBackend for chat inference
- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference)
## Architecture
## go-inference Interface Mapping
### Backend Interface
### Type Correspondence
| go-ml | go-inference | Notes |
|-------|-------------|-------|
| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel |
| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming |
| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. |
| `ml.Message` | `inference.Message` | Identical struct: Role + Content |
| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks |
| (no equivalent) | `inference.Token` | `{ID int32, Text string}` |
| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] |
### Method Mapping
```
ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error)
↕ InferenceAdapter collects tokens
inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error
↕ InferenceAdapter forwards tokens to callback
inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
ml.GenOpts{Temperature: 0.7, MaxTokens: 2048}
↕ convertOpts helper
inference.WithTemperature(0.7), inference.WithMaxTokens(2048)
```
### backend_mlx.go Before/After
**Before** (253 LOC — BROKEN, old subpackage imports):
```go
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
import (
"forge.lthn.ai/core/go-mlx"
"forge.lthn.ai/core/go-mlx/cache" // REMOVED
"forge.lthn.ai/core/go-mlx/model" // REMOVED
"forge.lthn.ai/core/go-mlx/sample" // REMOVED
"forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED
)
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
type MLXBackend struct {
model model.Model
tok *tokenizer.Tokenizer
caches []cache.Cache
sampler sample.Sampler
// ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup
}
```
Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based.
**After** (~60 LOC — uses go-inference + InferenceAdapter):
```go
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-mlx" // registers "metal" backend via init()
)
### Scoring Engine
func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath)
if err != nil { return nil, fmt.Errorf("mlx: %w", err) }
return &InferenceAdapter{model: m, name: "mlx"}, nil
}
```
Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results.
All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface.
**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity.
## Scoring Engine Architecture
**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality.
### 5 Suites
**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness.
| Suite | Method | LLM needed? | Metrics |
|-------|--------|-------------|---------|
| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite |
| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) |
| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) |
| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen |
| **Exact** | Numeric extraction | No | GSM8K answer matching |
**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning).
### LEK Score Formula
**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance.
```
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
- ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
```
### 23 Capability Probes
Positive signals: engagement depth, creative form, emotional register, first-person voice.
Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output.
16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought.
### Concurrency Model
### InfluxDB Integration
- Endpoint: `10.69.69.165:8181`
- Database: `training`
- Protocol: Line protocol writes (hand-rolled, no official client)
- Purpose: Streaming checkpoint scores during agent evaluation runs
### Data Pipeline
DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion.
## go-inference Gap
This is the critical finding driving Phase 1.
**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`.
**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func).
**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means:
1. MLX backend duplicates logic that go-inference already provides
2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface
3. Scoring engine is locked to the legacy string-return interface
**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md.
`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex.
## Known Issues
- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration.
- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate.
- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable.
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends.
- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests.
- **backend_mlx.go imports dead subpackages** — Blocked on Phase 1 migration
- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together
- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists
- **score.go concurrency untested** — No race condition tests
- **Message type duplication**`ml.Message` and `inference.Message` are identical but separate

160
TODO.md
View file

@ -1,46 +1,132 @@
# TODO.md — go-ml Task Queue
## Phase 1: go-inference Migration
The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC.
- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`.
- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams.
- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens.
- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface.
## Phase 2: Backend Consolidation
`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy.
- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`.
- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`.
- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase.
- [ ] **Unify GenOpts**`ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers.
## Phase 3: Agent Loop Modernisation
`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files.
- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming).
- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable.
- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library.
- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants.
## Phase 4: Test Coverage
`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests.
- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle.
- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors.
- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score.
- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently.
Dispatched from Virgil in core/go. Pick up tasks in phase order.
---
## Standing: Workflow
## Phase 1: go-inference Migration (CRITICAL PATH)
Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring.
### Step 1.1: Add go-inference dependency
- [ ] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves.
### Step 1.2: Write the InferenceAdapter
- [ ] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement:
- `Generate()` — collect tokens from iterator into string
- `Chat()` — same, using `TextModel.Chat()`
- `GenerateStream()` — forward tokens to `TokenCallback`
- `ChatStream()` — same for chat
- `Name()` — delegate to `TextModel.ModelType()`
- `Available()` — always true (model already loaded)
- `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options
**Key mapping**:
```
GenOpts.Temperature → inference.WithTemperature(float32(t))
GenOpts.MaxTokens → inference.WithMaxTokens(n)
GenOpts.Model → (ignored, model already loaded)
```
**Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled).
- [ ] **Test adapter.go** — Test with a mock `inference.TextModel` that yields predetermined tokens. Test cases:
- Normal generation (collect tokens → string)
- Streaming (each token hits callback)
- Callback error stops iteration
- Context cancellation propagates
- Empty output (EOS immediately)
- Model error after partial output
### Step 1.3: Rewrite backend_mlx.go
- [ ] **Replace backend_mlx.go** — Delete the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replace with ~60 LOC:
```go
//go:build darwin && arm64
package ml
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-mlx" // registers "metal" backend
)
func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath)
if err != nil {
return nil, fmt.Errorf("mlx: %w", err)
}
return &InferenceAdapter{model: m, name: "mlx"}, nil
}
```
The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic.
- [ ] **Preserve memory controls** — The old `MLXBackend` set cache/memory limits (16GB/24GB). These should be configurable. Options:
- Accept memory limits in `NewMLXBackend` params
- Or set them in `InferenceAdapter` wrapper
- go-mlx exposes `SetCacheLimit()` / `SetMemoryLimit()` at package level
- [ ] **Test backend_mlx.go** — Verify the new backend can:
- Load a model via go-inference registry
- Generate text (smoke test, requires model on disk)
- Stream tokens via callback
- Handle Metal availability check (build tag gating)
### Step 1.4: HTTPBackend and LlamaBackend wrappers
- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend already works fine as `ml.Backend`. For go-inference compatibility, write a thin wrapper that implements `inference.TextModel`:
- `Generate()` calls HTTP API, yields entire response as single Token
- `Chat()` same
- This is lower priority than MLX — HTTP backends don't need the full iter.Seq pattern
- Consider SSE streaming: `/v1/chat/completions` with `"stream": true` returns SSE events that CAN be yielded as `iter.Seq[Token]`
- [ ] **LlamaBackend go-inference wrapper** — LlamaBackend delegates to HTTPBackend already. Same treatment.
### Step 1.5: Verify downstream consumers
- [ ] **Service.Generate() still works**`service.go` calls `Backend.Generate()`. After migration, backends wrapped in `InferenceAdapter` must still satisfy `ml.Backend`.
- [ ] **Judge still works**`judge.go` uses `Backend.Generate()` for LLM-as-judge. Verify scoring pipeline runs end-to-end.
- [ ] **go-ai tools_ml.go** — Uses `ml.Service` directly. No code changes needed in go-ai if `ml.Backend` interface is preserved.
---
## Phase 2: Backend Consolidation
After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile.
- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate to `iter.Seq[Token]`.
- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated.
- [ ] **Unify GenOpts**`ml.GenOpts` and `inference.GenerateConfig` overlap. Add `convertOpts()` in Phase 1, consolidate into one struct later.
- [ ] **Unify Message types**`ml.Message` and `inference.Message` are identical structs. Consider type alias or shared import.
---
## Phase 3: Agent Loop Modernisation
`agent.go` (1,070 LOC) is the largest file. Decompose.
- [ ] **Split agent.go** — Into: `agent_config.go` (config, model maps), `agent_execute.go` (run loop, checkpoint processing), `agent_eval.go` (probe evaluation, result publishing), `agent_influx.go` (InfluxDB streaming, JSONL buffer).
- [ ] **Abstract SSH transport** — Extract SSH checkpoint discovery into interface. Current M3 homelab SSH may change to Linux (go-rocm).
- [ ] **Configurable endpoints**`10.69.69.165:8181` and M3 SSH details hardcoded. Move to config/environment.
- [ ] **InfluxDB client** — Hand-rolled line protocol. Evaluate official InfluxDB Go client.
---
## Phase 4: Test Coverage
- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, health checks, process lifecycle.
- [ ] **backend_mlx_test.go** — After Phase 1 rewrite, test with mock go-inference TextModel.
- [ ] **score.go race tests**`go test -race ./...`. Concurrent scoring, semaphore boundaries, context cancellation.
- [ ] **Benchmark suite**`BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes.
---
## Workflow
1. Virgil in core/go writes tasks here after research
2. This repo's session picks up tasks in phase order
3. Mark `[x]` when done, note commit hash
4. Phase 1 is the critical path — everything else builds on go-inference migration
4. New discoveries → add tasks, note in FINDINGS.md
5. Push to forge after each completed step: `git push forge main`

4
go.mod
View file

@ -31,6 +31,8 @@ require (
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da // indirect
)
replace forge.lthn.ai/core/go => ../core
replace forge.lthn.ai/core/go => ../host-uk/core
replace forge.lthn.ai/core/go-mlx => ../go-mlx
replace forge.lthn.ai/core/go-inference => ../go-inference