docs: graduate TODO/FINDINGS into production documentation

Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Snider 2026-02-20 15:01:55 +00:00
parent 7075d7cbe7
commit 3918051112
6 changed files with 892 additions and 531 deletions

143
CLAUDE.md
View file

@ -1,43 +1,18 @@
# CLAUDE.md — go-ml Domain Expert Guide
# CLAUDE.md — go-ml Agent Guide
You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates your work via TODO.md. Pick up tasks in phase order, mark `[x]` when done, commit and push.
You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates work. Pick up tasks in phase order, mark `[x]` when done, commit and push.
## What This Package Does
ML inference backends, scoring engine, and agent orchestrator. 7.5K LOC across 41 Go files. Provides:
ML inference backends, scoring engine, and agent orchestrator. ~7,500 LOC across 41 Go files. Provides:
- **Pluggable inference backends** — MLX/Metal (darwin/arm64), llama.cpp (subprocess), HTTP/Ollama (OpenAI-compatible)
- **Multi-suite scoring engine**Heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K)
- **23 capability probes**Binary pass/fail tests across 16 categories (math, logic, code, etc.)
- **GGUF model management**Format parsing, conversion, inventory
- **Multi-suite scoring engine**heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K)
- **23 capability probes**binary pass/fail tests across 16 categories
- **GGUF model management**format parsing, conversion, inventory
- **Agent orchestrator** — SSH checkpoint discovery, InfluxDB streaming, batch evaluation
## Critical Context: go-inference Migration
**Phase 1 is complete.** Both directions of the bridge are implemented:
1. **Forward adapter** (`adapter.go`): `inference.TextModel` (iter.Seq) -> `ml.Backend`/`ml.StreamingBackend` (string/callback). Used by `backend_mlx.go` to wrap Metal GPU models.
2. **Reverse adapters** (`backend_http_textmodel.go`): `HTTPBackend`/`LlamaBackend` -> `inference.TextModel`. Enables HTTP and llama-server backends to be used anywhere that expects a go-inference TextModel.
### Interface Bridge (DONE)
```
ml.Backend (string) <──adapter.go──> inference.TextModel (iter.Seq[Token])
<──backend_http_textmodel.go──>
```
- `InferenceAdapter`: TextModel -> Backend + StreamingBackend (for MLX, ROCm, etc.)
- `HTTPTextModel`: HTTPBackend -> TextModel (for remote APIs)
- `LlamaTextModel`: LlamaBackend -> TextModel (for managed llama-server)
### backend_mlx.go (DONE)
Rewritten from 253 LOC to ~35 LOC. Loads via `inference.LoadModel()` and wraps in `InferenceAdapter`. Uses go-mlx's Metal backend registered via `init()`.
### Downstream Consumers Verified
- `service.go``Service.Generate()` calls `Backend.Generate()`. InferenceAdapter satisfies Backend. No changes needed.
- `judge.go``Judge.judgeChat()` calls `Backend.Generate()`. Same contract, works as before.
See `docs/architecture.md` for the full architecture reference.
## Commands
@ -45,7 +20,7 @@ Rewritten from 253 LOC to ~35 LOC. Loads via `inference.LoadModel()` and wraps i
go mod download # FIRST RUN: populate go.sum
go test ./... # Run all tests
go test -v -run TestHeuristic # Single test
go test -bench=. ./... # Benchmarks (none exist yet)
go test -bench=. ./... # Benchmarks
go test -race ./... # Race detector
go vet ./... # Static analysis
```
@ -56,103 +31,16 @@ All resolve via `replace` directives in go.mod:
| Module | Local Path | Notes |
|--------|-----------|-------|
| `forge.lthn.ai/core/go` | `../host-uk/core` | Framework (ServiceRuntime, process, log) |
| `forge.lthn.ai/core/go` | `../go` | Framework (ServiceRuntime, process, log) |
| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend (darwin/arm64 only) |
| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Backend interfaces |
## Architecture
### Backends (pluggable inference)
| File | Backend | Status |
|------|---------|--------|
| `adapter.go` | InferenceAdapter (TextModel -> Backend) | DONE — bridges go-inference to ml.Backend |
| `backend_mlx.go` | MLX/Metal GPU | DONE — uses go-inference LoadModel + InferenceAdapter |
| `backend_http.go` | HTTP API (OpenAI-compatible) | Works as ml.Backend |
| `backend_http_textmodel.go` | HTTPTextModel + LlamaTextModel | DONE — reverse wrappers (Backend -> TextModel) |
| `backend_llama.go` | llama-server subprocess | Works as ml.Backend |
| `ollama.go` | Ollama helpers | Works |
### Scoring Engine
| File | LOC | Purpose |
|------|-----|---------|
| `score.go` | 212 | Concurrent scoring orchestrator (semaphore-bounded workers) |
| `heuristic.go` | 258 | 9 regex-based metrics, LEK composite score |
| `judge.go` | 205 | LLM-as-judge (6 scoring methods) |
| `exact.go` | 77 | GSM8K exact-match with numeric extraction |
| `probes.go` | 273 | 23 binary capability probes across 16 categories |
### Data Pipeline
| File | LOC | Purpose |
|------|-----|---------|
| `agent.go` | 1,070 | Scoring agent (SSH checkpoint discovery, InfluxDB) |
| `worker.go` | 403 | LEM API worker for distributed inference |
| `service.go` | 162 | Core framework integration (lifecycle, backend registry) |
| `ingest.go` | 384 | JSONL response loading |
| `db.go` | 258 | DuckDB analytics storage |
| `gguf.go` | 369 | GGUF model format parsing |
### Backend Architecture
Two interface families coexist, bridged by adapters:
**`inference.TextModel`** (iterator-based) is the **preferred API** for new code. Returns `iter.Seq[inference.Token]` for streaming. Defined in `forge.lthn.ai/core/go-inference`. Use this for GPU backends (MLX Metal, ROCm) and any code that needs token-level control.
**`ml.Backend`** (string-based) is the **compatibility layer**, still fully supported. Returns complete strings. Used by `service.go`, `judge.go`, and external consumers like `host-uk/cli`.
**`ml.StreamingBackend`** is **deprecated**. New code should use `inference.TextModel` with `iter.Seq[Token]` directly. Retained for backward compatibility with existing callers.
**Adapters:**
| Adapter | Direction | File |
|---------|-----------|------|
| `InferenceAdapter` | `inference.TextModel` -> `ml.Backend` + `ml.StreamingBackend` | `adapter.go` |
| `HTTPTextModel` | `ml.HTTPBackend` -> `inference.TextModel` | `backend_http_textmodel.go` |
| `LlamaTextModel` | `ml.LlamaBackend` -> `inference.TextModel` | `backend_http_textmodel.go` |
**Unified types (Phase 2):**
- `ml.Message` is a type alias for `inference.Message` — the types are identical, no conversion needed between packages.
- `ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model overrides. The `convertOpts()` helper maps GenOpts to `[]inference.GenerateOption`.
### Key Types
```go
// Backend interface (inference.go) — compatibility layer
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
// Deprecated: use inference.TextModel with iter.Seq[Token] directly
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
type GenOpts struct {
Temperature float64
MaxTokens int
Model string // override model for this request
TopK int // top-k sampling (0 = disabled)
TopP float64 // nucleus sampling threshold (0 = disabled)
RepeatPenalty float64 // repetition penalty (0 = disabled, 1.0 = no penalty)
}
// Type alias — identical to inference.Message
type Message = inference.Message
```
## Coding Standards
- **UK English**: colour, organisation, centre
- **Tests**: testify assert/require (existing), Pest-style names welcome for new tests
- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(mlx):`
- **UK English**: colour, organisation, centre, licence (noun)
- **SPDX header**: `// SPDX-Licence-Identifier: EUPL-1.2` in every new source file
- **Tests**: testify assert/require; `_Good`/`_Bad`/`_Ugly` suffix pattern
- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(agent):`
- **Co-Author**: `Co-Authored-By: Virgil <virgil@lethean.io>`
- **Licence**: EUPL-1.2
- **Imports**: stdlib → forge.lthn.ai → third-party, each group separated by blank line
@ -161,8 +49,3 @@ type Message = inference.Message
- **Repo**: `forge.lthn.ai/core/go-ml`
- **Push via SSH**: `git push forge main` (remote: `ssh://git@forge.lthn.ai:2223/core/go-ml.git`)
## Task Queue
See `TODO.md` for prioritised work. Phase 1 (go-inference migration) is the critical path.
See `FINDINGS.md` for research notes and interface mapping.

View file

@ -1,208 +0,0 @@
# FINDINGS.md — go-ml Research & Discovery
## 2026-02-19: Split from go-ai (Virgil)
### Origin
Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
### What Was Extracted
- 41 Go files (~7,494 LOC excluding tests)
- 6 test files (backend_http, exact, heuristic, judge, probes, score)
- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
### Dependencies
- `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only)
- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1)
- `forge.lthn.ai/core/go` — Framework services, process management, logging
- `github.com/marcboeker/go-duckdb` — Analytics storage
- `github.com/parquet-go/parquet-go` — Columnar data I/O
- `github.com/stretchr/testify` — Test assertions
### Consumers
- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`)
- LEM Lab — Uses MLXBackend for chat inference
- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference)
## go-inference Interface Mapping
### Type Correspondence
| go-ml | go-inference | Notes |
|-------|-------------|-------|
| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel |
| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming |
| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. |
| `ml.Message` | `inference.Message` | Identical struct: Role + Content |
| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks |
| (no equivalent) | `inference.Token` | `{ID int32, Text string}` |
| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] |
### Method Mapping
```
ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error)
↕ InferenceAdapter collects tokens
inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error
↕ InferenceAdapter forwards tokens to callback
inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]
ml.GenOpts{Temperature: 0.7, MaxTokens: 2048}
↕ convertOpts helper
inference.WithTemperature(0.7), inference.WithMaxTokens(2048)
```
### backend_mlx.go Before/After
**Before** (253 LOC — BROKEN, old subpackage imports):
```go
import (
"forge.lthn.ai/core/go-mlx"
"forge.lthn.ai/core/go-mlx/cache" // REMOVED
"forge.lthn.ai/core/go-mlx/model" // REMOVED
"forge.lthn.ai/core/go-mlx/sample" // REMOVED
"forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED
)
type MLXBackend struct {
model model.Model
tok *tokenizer.Tokenizer
caches []cache.Cache
sampler sample.Sampler
// ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup
}
```
**After** (~60 LOC — uses go-inference + InferenceAdapter):
```go
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-mlx" // registers "metal" backend via init()
)
func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath)
if err != nil { return nil, fmt.Errorf("mlx: %w", err) }
return &InferenceAdapter{model: m, name: "mlx"}, nil
}
```
All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface.
## Scoring Engine Architecture
### 5 Suites
| Suite | Method | LLM needed? | Metrics |
|-------|--------|-------------|---------|
| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite |
| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) |
| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) |
| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen |
| **Exact** | Numeric extraction | No | GSM8K answer matching |
### LEK Score Formula
```
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
- ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
```
Positive signals: engagement depth, creative form, emotional register, first-person voice.
Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output.
### Concurrency Model
`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex.
## Phase 2 Audit: StreamingBackend Usage (Virgil, 20 Feb 2026)
### Callers of GenerateStream/ChatStream
Only 2 files across the entire ecosystem call StreamingBackend methods:
1. **`host-uk/cli/cmd/ml/cmd_serve.go`** (lines 146, 201, 319)
- Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming
- `/v1/completions``streamer.GenerateStream()` (line 201)
- `/v1/chat/completions``streamer.ChatStream()` (line 319)
- Has non-streaming fallback: `backend.Generate()` when assertion fails
2. **`host-uk/cli/cmd/ml/cmd_chat.go`**
- Direct `ChatStream()` call for terminal token-by-token echo
- No fallback — assumes backend supports streaming
### Non-streaming consumers (use Backend.Generate only)
| File | Method | Notes |
|------|--------|-------|
| service.go | `Backend.Generate()` | Backend registry dispatch |
| judge.go | `Backend.Generate()` | Via judgeChat() |
| agent.go | `Backend.Generate()` | Probe evaluation |
| expand.go | `Backend.Generate()` | Prompt expansion |
| go-ai/mcp/tools_ml.go | `ml.Service` | Via service layer |
### Backend Implementation Status
| Backend | Backend? | StreamingBackend? | Notes |
|---------|----------|-------------------|-------|
| InferenceAdapter | YES | YES | Bridges iter.Seq[Token] → callbacks |
| HTTPBackend | YES | NO | Returns complete string from API |
| LlamaBackend | YES | NO | Returns complete string via HTTP |
### Conclusion
StreamingBackend is only needed by `host-uk/cli` (2 files, out of go-ml scope). Safe to deprecate in go-ml with a comment. The actual migration of those CLI files is a separate task for the cli repo.
### GenOpts vs GenerateConfig Field Comparison
| ml.GenOpts | inference.GenerateConfig | Type |
|-----------|--------------------------|------|
| Temperature | Temperature | float64 vs float32 |
| MaxTokens | MaxTokens | int (same) |
| Model | (none) | string |
| (none) | TopK | int |
| (none) | TopP | float32 |
| (none) | StopTokens | []int32 |
| (none) | RepeatPenalty | float32 |
| (none) | ReturnLogits | bool |
## Known Issues
- ~~**backend_mlx.go imports dead subpackages**~~ — FIXED in Phase 1 (`c3c2c14`)
- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together
- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists
- **score.go concurrency untested** — No race condition tests
- ~~**Message type duplication**~~ — FIXED in Phase 2 (`747e703`): type alias `Message = inference.Message`
## Phase 3 Audit: agent.go Structure (Virgil, 20 Feb 2026)
### File Layout (1,070 LOC)
| Section | Lines | LOC | Purpose |
|---------|-------|-----|---------|
| Types & Config | 19112 | ~95 | `AgentConfig`, `Checkpoint`, config maps, `AdapterMeta()` |
| Main Loop | 141343 | ~200 | `RunAgentLoop()`, checkpoint discovery, unscored filtering |
| Evaluation | 345700 | ~355 | MLX-native + conversion paths, 4 probe functions |
| Judge & Push | 708887 | ~180 | Scoring, InfluxDB line protocol, DuckDB dual-write |
| Buffering | 926977 | ~50 | JSONL buffer for InfluxDB failures |
| SSH/SCP | 9791070 | ~90 | `SSHCommand()`, `SCPFrom()`, `SCPTo()`, utility helpers |
### Hardcoded Infrastructure
- SSH options duplicated across 3 functions: `ConnectTimeout=10, BatchMode=yes, StrictHostKeyChecking=no`
- InfluxDB timestamp base: `1739577600` (13 Feb 2026 00:00 UTC)
- InfluxDB measurements: `probe_score`, `capability_score`, `capability_judge`, `content_score`
- DuckDB tables: `checkpoint_scores`, `probe_results`
### Test Coverage
Zero tests for agent.go. Testable without infrastructure:
- `AdapterMeta()` — pure function, dirname → metadata
- `FindUnscored()` — filtering logic
- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — JSONL round-trip

193
TODO.md
View file

@ -1,193 +0,0 @@
# TODO.md — go-ml Task Queue
Dispatched from Virgil in core/go. Pick up tasks in phase order.
---
## Phase 1: go-inference Migration (CRITICAL PATH)
Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring.
### Step 1.1: Add go-inference dependency
- [x] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves.
### Step 1.2: Write the InferenceAdapter
- [x] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement:
- `Generate()` — collect tokens from iterator into string
- `Chat()` — same, using `TextModel.Chat()`
- `GenerateStream()` — forward tokens to `TokenCallback`
- `ChatStream()` — same for chat
- `Name()` — delegate to `TextModel.ModelType()`
- `Available()` — always true (model already loaded)
- `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options
**Key mapping**:
```
GenOpts.Temperature → inference.WithTemperature(float32(t))
GenOpts.MaxTokens → inference.WithMaxTokens(n)
GenOpts.Model → (ignored, model already loaded)
```
**Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled).
- [x] **Test adapter.go** — 13 test cases with mock TextModel (all pass). Test cases:
- Normal generation (collect tokens → string)
- Streaming (each token hits callback)
- Callback error stops iteration
- Context cancellation propagates
- Empty output (EOS immediately)
- Model error after partial output
### Step 1.3: Rewrite backend_mlx.go
- [x] **Replace backend_mlx.go** — Deleted the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replaced with ~35 LOC:
```go
//go:build darwin && arm64
package ml
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-mlx" // registers "metal" backend
)
func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath)
if err != nil {
return nil, fmt.Errorf("mlx: %w", err)
}
return &InferenceAdapter{model: m, name: "mlx"}, nil
}
```
The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic.
- [x] **Preserve memory controls** — Deferred: go-mlx handles cache/memory limits internally. Callers can use `mlx.SetCacheLimit()`/`mlx.SetMemoryLimit()` directly. No wrapper needed until a concrete use case arises.
- [x] **Test backend_mlx.go** — Covered by Phase 4 `backend_mlx_test.go` (8 tests via mock TextModel). Integration smoke test with real model deferred until LEM Lab pipeline is wired.
### Step 1.4: HTTPBackend and LlamaBackend wrappers
- [x] **HTTPBackend go-inference wrapper**`backend_http_textmodel.go`: `HTTPTextModel` wraps `HTTPBackend` to implement `inference.TextModel`. Generate/Chat yield entire response as single Token. Classify returns unsupported error. BatchGenerate processes prompts sequentially. 17 tests pass.
- [x] **LlamaBackend go-inference wrapper**`backend_http_textmodel.go`: `LlamaTextModel` embeds `HTTPTextModel`, overrides `ModelType()` -> "llama" and `Close()` -> `llama.Stop()`. 2 tests pass.
### Step 1.5: Verify downstream consumers
- [x] **Service.Generate() still works**`service.go` calls `Backend.Generate()`. InferenceAdapter satisfies ml.Backend. HTTPBackend/LlamaBackend still implement ml.Backend directly. No changes needed.
- [x] **Judge still works**`judge.go` calls `Backend.Generate()` via `judgeChat()`. Same Backend contract, works as before. No changes needed.
- [x] **go-ai tools_ml.go** — Uses `ml.Service` directly. `ml.Backend` interface is preserved, no code changes needed in go-ai.
---
## Phase 2: Backend Consolidation
After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile.
### Audit Results (Virgil, 20 Feb 2026)
**StreamingBackend callers** — Only 2 files in `host-uk/cli`:
- `cmd/ml/cmd_serve.go` lines 146,201,319: Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming at `/v1/completions` and `/v1/chat/completions`
- `cmd/ml/cmd_chat.go`: Direct `ChatStream()` call for interactive terminal token echo
All other consumers (service.go, judge.go, agent.go, expand.go, go-ai tools_ml.go) use `Backend.Generate()` — NOT streaming.
**Backend implementations**:
- `InferenceAdapter` → implements Backend + StreamingBackend (via go-inference iter.Seq)
- `HTTPBackend` → implements Backend only (no streaming)
- `LlamaBackend` → implements Backend only (no streaming)
### Step 2.1: Unify Message types
- [x] **Type alias ml.Message → inference.Message** — In `inference.go`, replace the `Message` struct with:
```go
type Message = inference.Message
```
This is backward-compatible — all existing callers keep working. Remove the `convertMessages()` helper from `adapter.go` since types are now identical. Verify with `go build ./...` and `go test ./...`.
### Step 2.2: Unify GenOpts
- [x] **Add inference fields to GenOpts** — Extend `ml.GenOpts` to include the extra fields from `inference.GenerateConfig`:
```go
type GenOpts struct {
Temperature float64
MaxTokens int
Model string // override model for this request
TopK int // NEW: from inference.GenerateConfig
TopP float64 // NEW: from inference.GenerateConfig (float64 to match Temperature)
RepeatPenalty float64 // NEW: from inference.GenerateConfig
}
```
Update `convertOpts()` in adapter.go to map the new fields. Existing callers that only set Temperature/MaxTokens/Model continue working unchanged.
### Step 2.3: Deprecate StreamingBackend
- [x] **Mark StreamingBackend as deprecated** — Add deprecation comment:
```go
// Deprecated: StreamingBackend is retained for backward compatibility.
// New code should use inference.TextModel with iter.Seq[Token] directly.
// See InferenceAdapter for the bridge pattern.
type StreamingBackend interface { ... }
```
Do NOT remove yet — `host-uk/cli` cmd_serve.go and cmd_chat.go still depend on it. Those migrations are out of scope for go-ml (they live in a different repo).
### Step 2.4: Document migration path
- [x] **Update CLAUDE.md** — Add "Backend Architecture" section documenting:
- `inference.TextModel` (iterator-based) is the preferred API for new code
- `ml.Backend` (string-based) is the compatibility layer, still supported
- `StreamingBackend` is deprecated, use `iter.Seq[Token]` directly
- `InferenceAdapter` bridges TextModel → Backend/StreamingBackend
- `HTTPTextModel`/`LlamaTextModel` bridges Backend → TextModel (reverse direction)
---
## Phase 3: Agent Loop Modernisation
`agent.go` (1,070 LOC) is the largest file with SSH, InfluxDB, scoring, and publishing mixed together. Decompose into focused files.
### Step 3.1: Split agent.go into 5 files — COMPLETE
- [x] **Split `agent.go` (1,070 LOC) into 5 focused files** — Commit `eae9ec9`. All `go build/test/vet` pass:
- `agent_config.go` (97 LOC): AgentConfig, Checkpoint, BaseModelMap, ModelFamilies, AdapterMeta()
- `agent_execute.go` (215 LOC): RunAgentLoop, DiscoverCheckpoints, GetScoredLabels, FindUnscored, ProcessOne, isMLXNative
- `agent_eval.go` (397 LOC): processMLXNative, processWithConversion, RunCapabilityProbes/Full, RunContentProbes, ProbeResult types
- `agent_influx.go` (291 LOC): ScoreCapabilityAndPush, ScoreContentAndPush, PushCapability*, BufferInfluxResult, ReplayInfluxBuffer
- `agent_ssh.go` (102 LOC): SSHCommand, SCPFrom, SCPTo, fileBase, EnvOr, IntEnvOr, ExpandHome
### Step 3.2: Abstract SSH transport — COMPLETE
- [x] **RemoteTransport interface + SSHTransport** — Commit `1c2a6a6`. Interface with Run/CopyFrom/CopyTo, SSHTransport implementation with functional options (WithPort, WithTimeout). AgentConfig.Transport field with lazy init. All callers updated (DiscoverCheckpoints, processMLXNative, processWithConversion). Old SSHCommand/SCPFrom/SCPTo preserved as deprecated wrappers. Build/test/vet clean.
### Step 3.3: Configurable infrastructure — COMPLETE
- [x] **Extract hardcoded values to constants** — Commit `12f3a1c`. 15 constants in agent_config.go: EpochBase, 5 InfluxDB measurements, 2 DuckDB tables, probe defaults (temp/maxTokens/truncation), InfluxBufferFile, LogSeparatorWidth, InterCheckpointDelay. Hardcoded probe counts replaced with len(). 7 files, build/test/vet clean.
### Step 3.4: Agent tests — COMPLETE
- [x] **Test `AdapterMeta()`** — 8 tests: known families (12 entries), variant suffix, subdirectory patterns, unknown fallback, no-prefix edge case. Commit `3e22761`.
- [x] **Test `FindUnscored()`** — 5 tests: all unscored (sorted), some scored, all scored, empty input, nil scored map. Commit `3e22761`.
- [x] **Test `BufferInfluxResult()`/`ReplayInfluxBuffer()`** — 4 tests: JSONL round-trip, multiple entries, empty file, missing file. Commit `3e22761`.
- [x] **Test `DiscoverCheckpoints()`** — 6 tests: happy path (3 checkpoints across 2 dirs), subdirectory pattern, no adapters, SSH error, filter pattern, no safetensors. Uses `fakeTransport` mock implementing `RemoteTransport`. Commit `3e22761`.
---
## Phase 4: Test Coverage — COMPLETE
All 4 test files created and verified with `go test -race ./...`. Commit `09bf403`.
- [x] **backend_llama_test.go** — 20 tests via httptest mock: Name, Available (4 variants), Generate (6 variants incl. context cancellation, empty choices, opts forwarding), Chat (3 variants), Stop, constructor (4 variants), interface compliance.
- [x] **backend_mlx_test.go** — 8 tests via mock TextModel (no build tag needed): Generate, Chat, Stream, ModelError, Close, ModelAccess, InterfaceCompliance, ConvertOpts.
- [x] **score_race_test.go** — 6 race-condition tests: ConcurrentSemantic (20 responses, concurrency=4), ConcurrentMixedSuites (semantic+standard+content fan-out), SemaphoreBoundary (concurrency=1, verifies max concurrent==1), ContextCancellation (400 error→nil semantic), HeuristicOnlyNoRace (50 responses), MultiModelConcurrent (4 models×5 concurrent map writes).
- [x] **benchmark_test.go** — 25 benchmarks: HeuristicScore (5 sizes: 25µs8.8ms), ExactMatch (4 patterns: 171ns2.1µs), JudgeExtractJSON (6 variants: 2.53.4µs), Judge round-trip (2 suites: ~52µs), ScoreAll (2 modes: 25µs4.5ms), sub-components (5 heuristic stages: 244ns88µs). Baselines on M3 Ultra.
---
## Workflow
1. Virgil in core/go writes tasks here after research
2. This repo's session picks up tasks in phase order
3. Mark `[x]` when done, note commit hash
4. New discoveries → add tasks, note in FINDINGS.md
5. Push to forge after each completed step: `git push forge main`

378
docs/architecture.md Normal file
View file

@ -0,0 +1,378 @@
# go-ml Architecture
## Overview
`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.
The package provides three distinct subsystems:
1. **Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
2. **Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
3. **Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.
---
## Dependency Graph
```
forge.lthn.ai/core/go-ml
├── forge.lthn.ai/core/go-inference (shared TextModel/Token interfaces)
│ └── (no further Core deps)
├── forge.lthn.ai/core/go-mlx (Metal GPU inference, darwin/arm64 only)
│ └── forge.lthn.ai/core/go-inference
├── forge.lthn.ai/core/go (ServiceRuntime, process, log)
├── github.com/marcboeker/go-duckdb (analytics storage)
└── github.com/parquet-go/parquet-go (columnar data I/O)
```
### Role of each dependency
| Module | Purpose |
|--------|---------|
| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. |
| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. |
| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. |
| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. |
---
## Backend Architecture
Two interface families coexist within go-ml, connected by a set of adapters.
### The `ml.Backend` interface (compatibility layer)
```go
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
```
`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface.
### The `inference.TextModel` interface (preferred for new code)
Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`.
### `ml.StreamingBackend` (deprecated)
```go
// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
```
Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it.
### Type unification
`ml.Message` is a type alias for `inference.Message`:
```go
type Message = inference.Message
```
The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages.
`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection:
```go
type GenOpts struct {
Temperature float64
MaxTokens int
Model string // per-request model override; ignored by GPU backends
TopK int
TopP float64
RepeatPenalty float64
}
```
---
## Backend Implementations
### HTTPBackend (`backend_http.go`)
Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).
- Implements `ml.Backend` only (no streaming — returns complete response strings).
- Retries up to 3 times with exponential backoff on 5xx and connection errors.
- 300-second HTTP client timeout suitable for long-running inference.
### LlamaBackend (`backend_llama.go`)
Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`.
- Implements `ml.Backend`.
- `Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds.
- `Stop()` kills the managed process via the Core `process.Service`.
- Supports optional LoRA adapter loading via `--lora`.
### InferenceAdapter (`adapter.go`)
Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.
```
inference.TextModel (iter.Seq[Token])
└─── InferenceAdapter ───► ml.Backend (string)
───► ml.StreamingBackend (TokenCallback)
```
Key behaviours:
- `Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
- `GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops.
- `Available()` always returns `true` — the model is already loaded when the adapter is constructed.
- `Close()` delegates to `TextModel.Close()`, releasing GPU memory.
### MLX Backend (`backend_mlx.go`, darwin/arm64 only)
```go
//go:build darwin && arm64
func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath, loadOpts...)
// ...
return NewInferenceAdapter(m, "mlx"), nil
}
```
The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon.
The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package.
### Reverse adapters (`backend_http_textmodel.go`)
Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`).
| Type | Wraps | Notes |
|------|-------|-------|
| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. |
| `LlamaTextModel` | `*LlamaBackend` | Embeds `HTTPTextModel`; overrides `ModelType()``"llama"` and `Close()``llama.Stop()`. |
### Adapter map (all directions)
```
ml.Backend (string) <──── InferenceAdapter ──── inference.TextModel (iter.Seq[Token])
(adapter.go)
ml.HTTPBackend ──── HTTPTextModel ────► inference.TextModel
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
(backend_http_textmodel.go)
```
---
## Service Layer (`service.go`)
`Service` integrates go-ml into the Core framework lifecycle:
```go
core.New(
framework.WithName("ml", ml.NewService(ml.Options{
OllamaURL: "http://localhost:11434",
JudgeURL: "http://localhost:11434",
JudgeModel: "qwen3:8b",
Concurrency: 4,
Suites: "all",
})),
)
```
`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`.
---
## Scoring Engine
### Engine (`score.go`)
`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently.
```
ScoreAll(responses []Response) map[string][]PromptScore
├── Heuristic (inline, no goroutine)
└── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
```
The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.
Suites are selected at engine construction time via a comma-separated string or `"all"`.
### Heuristic scoring (`heuristic.go`)
Analyses a response using pre-compiled regular expressions. No LLM is needed.
Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:
```
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
- ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
```
**Positive signals**
| Sub-score | What it measures |
|-----------|-----------------|
| `EngagementDepth` | Structural markers (headings, bold), ethical vocabulary, technical depth, word count |
| `CreativeForm` | Poetry structure (short lines), narrative openings, metaphor density |
| `EmotionalRegister` | Emotional vocabulary (feel, grief, compassion, etc.) |
| `FirstPerson` | Sentences beginning with "I" or containing first-person agency verbs |
**Negative signals**
| Sub-score | What it measures |
|-----------|-----------------|
| `ComplianceMarkers` | RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") |
| `FormulaicPreamble` | Opener templates ("Sure, let's...", "Great question") |
| `Degeneration` | Sentence repetition ratio (looping/stuck output) |
| `EmptyBroken` | Empty, error-prefixed, or pad-token-polluted responses |
### Judge (`judge.go`)
`Judge` uses any `Backend` as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.
```go
judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
scores, err := judge.ScoreSemantic(ctx, prompt, response)
```
JSON extraction (`extractJSON`) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.
Six scoring methods are available:
| Method | Suite | Dimensions |
|--------|-------|-----------|
| `ScoreSemantic` | semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept |
| `ScoreContent` | content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister |
| `ScoreCapability` | (agent) | Reasoning, Correctness, Clarity |
| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness |
| `ScoreDoNotAnswer` | standard | Safety, Nuance |
| `ScoreToxigen` | standard | Kindness, Awareness |
### Exact match (`exact.go`)
`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required.
### Capability probes (`probes.go`)
23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response.
| Category | Probes | Examples |
|----------|--------|---------|
| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci |
| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions |
| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug |
| Word problems (2) | word | speed/distance, sibling counting |
`StripThinkBlocks()` removes `<think>...</think>` sections from DeepSeek R1 responses before checking.
---
## Agent Orchestrator
The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").
### Files
| File | LOC | Responsibility |
|------|-----|---------------|
| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne` |
| `agent_eval.go` | 397 | MLX-native and conversion evaluation paths, capability and content probe runners |
| `agent_influx.go` | 291 | InfluxDB line-protocol push, JSONL buffer for offline replay |
| `agent_ssh.go` | 102 | `RemoteTransport` interface, `SSHTransport` implementation, utility helpers |
### Workflow
```
RunAgentLoop
├── ReplayInfluxBuffer (flush any buffered writes from previous failures)
├── DiscoverCheckpoints (SSH ls on M3 adapter directories)
├── GetScoredLabels (InfluxDB query for already-scored (run_id, label) pairs)
├── FindUnscored (set difference, sorted by dirname + iteration)
└── ProcessOne (for each unscored checkpoint)
├── isMLXNative? YES → processMLXNative (serve directly via mlx_lm.server)
│ NO → processWithConversion (MLX→GGUF, then llama-server)
├── RunCapabilityProbes (23 binary probes)
├── RunContentProbes (sovereignty probes)
├── ScoreCapabilityAndPush (judge + InfluxDB)
└── ScoreContentAndPush (judge + InfluxDB)
```
### RemoteTransport
`RemoteTransport` abstracts SSH/SCP so that tests can supply an in-memory fake:
```go
type RemoteTransport interface {
Run(ctx context.Context, cmd string) (string, error)
CopyFrom(ctx context.Context, remote, local string) error
CopyTo(ctx context.Context, local, remote string) error
}
```
`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`.
### Checkpoint discovery
`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename.
`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`.
### Persistence
Results are written to two stores simultaneously:
- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`.
- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`.
If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer.
---
## Data Pipeline
| File | Purpose |
|------|---------|
| `ingest.go` | Load JSONL response files into `[]Response` slices |
| `db.go` | DuckDB schema creation, insert, and query helpers |
| `influx.go` | InfluxDB HTTP client (line protocol write, SQL query) |
| `gguf.go` | GGUF file format parsing (magic, version, metadata, tensor inventory) |
| `worker.go` | LEM API worker for distributed inference job dispatch |
| `expand.go` | Prompt expansion using a backend |
| `normalize.go` | Response normalisation utilities |
| `parquet.go` | Parquet dataset export |
---
## Test Coverage
| File | Tests | What is covered |
|------|-------|----------------|
| `adapter_test.go` | 13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors |
| `backend_http_test.go` | — | HTTPBackend: generate, chat, retries, status codes |
| `backend_http_textmodel_test.go` | 19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch |
| `backend_llama_test.go` | 20 | LlamaBackend: start, stop, health, generate, chat, constructor variants |
| `backend_mlx_test.go` | 8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion |
| `heuristic_test.go` | — | All nine heuristic sub-scores and LEK formula |
| `judge_test.go` | — | JSON extraction variants, ScoreSemantic, ScoreContent |
| `exact_test.go` | — | Numeric extraction patterns |
| `probes_test.go` | — | All 23 capability probe Check functions |
| `score_test.go` | — | Engine suite selection, ScoreAll grouping |
| `score_race_test.go` | 6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes |
| `agent_test.go` | 23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport |
| `benchmark_test.go` | 25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) |

307
docs/development.md Normal file
View file

@ -0,0 +1,307 @@
# go-ml Development Guide
## Prerequisites
### Required
- **Go 1.25** or later (the module uses `go 1.25.5`)
- **Go workspace** — go-ml is part of the `host-uk/core` Go workspace; `replace` directives in `go.mod` resolve sibling modules from local paths
### Required sibling modules (local paths)
| Module | Local path | Notes |
|--------|-----------|-------|
| `forge.lthn.ai/core/go` | `../go` | Framework, process management, logging |
| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Token interfaces |
| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend |
All three must be checked out as siblings of `go-ml` (i.e. all four directories share the same parent).
### Platform-specific
- **Metal GPU (`NewMLXBackend`)** — requires macOS on Apple Silicon (darwin/arm64). The `backend_mlx.go` file carries a `//go:build darwin && arm64` build tag and is excluded on other platforms. All other features work on Linux and amd64.
- **llama-server** — the `llama-server` binary from llama.cpp must be on `PATH` or the path provided in `LlamaOpts.LlamaPath`.
- **DuckDB** — uses CGo; a C compiler (`gcc` or `clang`) is required.
---
## Getting Started
```bash
# On first checkout, populate go.sum
go mod download
# Verify the build (all platforms)
go build ./...
# Verify the build excluding Metal backend (Linux / CI)
GOFLAGS='-tags nomlx' go build ./...
```
---
## Build and Test Commands
```bash
# Run all tests
go test ./...
# Run with race detector (recommended before committing)
go test -race ./...
# Run a single test by name
go test -v -run TestHeuristic ./...
go test -v -run TestEngine_ScoreAll_ConcurrentSemantic ./...
# Run benchmarks
go test -bench=. ./...
go test -bench=BenchmarkHeuristicScore ./...
# Static analysis
go vet ./...
# Tidy dependencies
go mod tidy
```
---
## Test Patterns
### Naming convention
Tests use a `_Good`, `_Bad`, `_Ugly` suffix pattern:
- `_Good` — happy path (expected success)
- `_Bad` — expected error conditions (invalid input, unreachable server)
- `_Ugly` — panic and edge-case paths
### Mock backends
For tests that exercise `Backend`-dependent code (judge, agent, scoring engine) without a real inference server, implement `Backend` directly:
```go
type mockBackend struct {
response string
err error
}
func (m *mockBackend) Generate(_ context.Context, _ string, _ ml.GenOpts) (string, error) {
return m.response, m.err
}
func (m *mockBackend) Chat(_ context.Context, _ []ml.Message, _ ml.GenOpts) (string, error) {
return m.response, m.err
}
func (m *mockBackend) Name() string { return "mock" }
func (m *mockBackend) Available() bool { return true }
```
### Mock TextModel
For tests that exercise `InferenceAdapter` without Metal GPU hardware, implement `inference.TextModel`:
```go
type mockTextModel struct {
tokens []string
err error
}
func (m *mockTextModel) Generate(ctx context.Context, prompt string, opts ...inference.GenerateOption) iter.Seq[inference.Token] {
return func(yield func(inference.Token) bool) {
for _, t := range m.tokens {
if !yield(inference.Token{Text: t}) {
return
}
}
}
}
// ... implement remaining TextModel methods
func (m *mockTextModel) Err() error { return m.err }
```
### Mock RemoteTransport
For agent tests that would otherwise require an SSH connection:
```go
type fakeTransport struct {
outputs map[string]string
errors map[string]error
}
func (f *fakeTransport) Run(_ context.Context, cmd string) (string, error) {
if err, ok := f.errors[cmd]; ok {
return "", err
}
return f.outputs[cmd], nil
}
func (f *fakeTransport) CopyFrom(_ context.Context, _, _ string) error { return nil }
func (f *fakeTransport) CopyTo(_ context.Context, _, _ string) error { return nil }
```
Inject via `AgentConfig.Transport`:
```go
cfg := &ml.AgentConfig{
Transport: &fakeTransport{outputs: map[string]string{...}},
}
```
### HTTP mock server
For `HTTPBackend` tests, use `net/http/httptest`:
```go
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{
{"message": map[string]string{"role": "assistant", "content": "hello"}},
},
})
}))
defer srv.Close()
backend := ml.NewHTTPBackend(srv.URL, "test-model")
```
---
## Adding a New Backend
A backend must implement `ml.Backend`:
```go
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
```
### Steps
1. Create `backend_{name}.go` in the package root.
2. Add the `// SPDX-Licence-Identifier: EUPL-1.2` header.
3. Add a compile-time interface check:
```go
var _ Backend = (*MyBackend)(nil)
```
4. Implement `Generate` as a thin wrapper around `Chat` where possible (follows the pattern of `HTTPBackend`).
5. Create `backend_{name}_test.go` with `_Good`, `_Bad`, and interface-compliance tests.
6. Register the backend in `service.go`'s `OnStartup` if it warrants lifecycle management, or document that callers must register it via `Service.RegisterBackend`.
### GPU backends
If the backend wraps a `go-inference.TextModel` (e.g. a new hardware accelerator), use `InferenceAdapter` rather than re-implementing the polling/streaming logic:
```go
m, err := myBackendPackage.LoadModel(modelPath)
if err != nil {
return nil, err
}
return ml.NewInferenceAdapter(m, "my-backend"), nil
```
---
## Adding a New Scoring Suite
1. Add a new scoring function or type in a dedicated file (e.g. `my_suite.go`).
2. Add the suite name to `Engine.NewEngine`'s suite selection logic in `score.go`.
3. Add a result field to `PromptScore` in `types.go`.
4. Add the goroutine fan-out case in `Engine.ScoreAll` in `score.go`.
5. Add race condition tests in `score_race_test.go`.
---
## Coding Standards
### Language
Use **UK English** throughout: colour, organisation, centre, licence (noun), authorise. The only exception is identifiers in external APIs that use American spellings — do not rename those.
### File headers
Every new file must begin with:
```go
// SPDX-Licence-Identifier: EUPL-1.2
```
### Strict types
All parameters and return types must be explicitly typed. Avoid `interface{}` or `any` except at JSON unmarshalling boundaries.
### Import grouping
Three groups, each separated by a blank line:
```go
import (
"context" // stdlib
"fmt"
"forge.lthn.ai/core/go-inference" // forge.lthn.ai modules
"github.com/stretchr/testify/assert" // third-party
)
```
### Error wrapping
Use `fmt.Errorf("context: %w", err)` for wrapping. Use `log.E("pkg.Type.Method", "what failed", err)` from the Core framework for structured error logging with stack context.
### Concurrency
- Protect shared maps with `sync.RWMutex` or `sync.Mutex` as appropriate.
- Use semaphore channels (buffered `chan struct{}`) to bound goroutine concurrency rather than `sync.Pool` or `errgroup` with fixed limits.
- Always check `model.Err()` after exhausting a `go-inference` token iterator — the iterator itself carries no error; the error is stored on the model.
---
## Conventional Commits
Use the following scopes:
| Scope | When to use |
|-------|-------------|
| `backend` | Changes to any `backend_*.go` file or the `adapter.go` bridge |
| `scoring` | Changes to `score.go`, `heuristic.go`, `judge.go`, `exact.go` |
| `probes` | Changes to `probes.go` or capability probe definitions |
| `agent` | Changes to any `agent_*.go` file |
| `service` | Changes to `service.go` or `Options` |
| `types` | Changes to `types.go` or `inference.go` interfaces |
| `gguf` | Changes to `gguf.go` |
Examples:
```
feat(backend): add ROCm backend via go-rocm InferenceAdapter
fix(scoring): handle nil ContentScores when content probe not found
refactor(agent): replace SSHCommand with SSHTransport.Run
test(probes): add Check function coverage for all 23 probes
```
---
## Co-Author and Licence
Every commit must include:
```
Co-Authored-By: Virgil <virgil@lethean.io>
```
The licence is **EUPL-1.2**. All source files carry the SPDX identifier in the header. Do not add licence headers to test files; the package-level declaration covers them.
---
## Forge Remote
The authoritative remote is `forge.lthn.ai/core/go-ml`:
```bash
git push forge main
```
The SSH remote URL is `ssh://git@forge.lthn.ai:2223/core/go-ml.git`. HTTPS authentication is not configured — always push via SSH.

194
docs/history.md Normal file
View file

@ -0,0 +1,194 @@
# go-ml Project History
## Origin: Extraction from go-ai (19 February 2026)
go-ml began as the `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. The monolith had grown to approximately 14,000 LOC and 53% of that was the ML subsystem. The ML code had zero internal dependencies on the rest of go-ai — it imported only `go-mlx` (external) and the Core `go` framework. The extraction was therefore clean: lift the directory, adjust the module path, and update the one import in go-ai that referenced it.
**What was extracted:**
- 41 Go source files (~7,494 LOC, excluding tests)
- 6 test files covering backends, heuristic, judge, exact, probes, and score
- All InfluxDB, DuckDB, Parquet, GGUF, and agent code
**After extraction:**
- go-ai dropped from ~14,000 to ~3,400 LOC (the `ai/` facade and `mcp/` hub remain there)
- go-ml became an independent module at `forge.lthn.ai/core/go-ml`
---
## Phase 1: go-inference Migration (Complete)
**Commit range:** `c3c2c14` (initial fix) through adapter and reverse adapter work.
**Problem:** The original `backend_mlx.go` imported subpackages from go-mlx (`go-mlx/cache`, `go-mlx/model`, `go-mlx/sample`, `go-mlx/tokenizer`) that no longer existed after go-mlx's Phase 4 refactoring. The file was 253 LOC of hand-rolled tokenisation, KV cache management, sampling loops, and memory cleanup — and none of it compiled.
**Solution:** Introduce `go-inference` as the abstraction layer between go-ml and hardware backends.
### Step 1.1 — Add go-inference dependency
Added `forge.lthn.ai/core/go-inference` to `go.mod` with a `replace` directive pointing to the local sibling checkout.
### Step 1.2 — Write InferenceAdapter (`adapter.go`)
Created `InferenceAdapter`, which wraps a `go-inference.TextModel` (returning `iter.Seq[Token]`) and exposes it as `ml.Backend` + `ml.StreamingBackend` (returning strings / calling `TokenCallback`). Thirteen test cases verified token collection, streaming, callback error propagation, context cancellation, empty output, and model errors after partial generation.
Key design decision: after exhausting the iterator, `model.Err()` is checked separately. The iterator itself does not carry errors; partial output is returned alongside the error so callers can decide whether to use or discard it.
### Step 1.3 — Rewrite `backend_mlx.go`
Replaced 253 LOC with approximately 35 LOC. The blank import `_ "forge.lthn.ai/core/go-mlx"` registers the Metal backend via go-mlx's `init()`. `inference.LoadModel()` then handles model loading, and `InferenceAdapter` handles the rest.
Memory controls (cache limits, memory limits) were deferred: go-mlx handles them internally, and callers that need explicit control can call `mlx.SetCacheLimit()` directly.
### Step 1.4 — Reverse adapters (`backend_http_textmodel.go`)
Added `HTTPTextModel` and `LlamaTextModel`, which wrap the existing `ml.Backend` implementations to satisfy `inference.TextModel`. This enables HTTP and llama-server backends to be used in packages (go-ai, go-i18n) that consume the go-inference interface. Since HTTP backends return complete strings rather than streaming tokens, each response is yielded as a single `Token`.
17 tests for `HTTPTextModel` and 2 for `LlamaTextModel` all pass.
### Step 1.5 — Downstream verification
Confirmed that `service.go` (`Backend.Generate()`), `judge.go` (`judgeChat()`), and `go-ai/mcp/tools_ml.go` (`ml.Service`) required no changes — `InferenceAdapter` satisfies `ml.Backend`, and the existing consumers are unaffected.
---
## Phase 2: Backend Consolidation (Complete)
**Commit range:** `747e703` (Message unification) through `convertOpts` extension.
**Audit (Virgil, 20 February 2026):** Only two files in the entire ecosystem call `StreamingBackend` methods: `host-uk/cli/cmd/ml/cmd_serve.go` (SSE streaming at `/v1/completions` and `/v1/chat/completions`) and `cmd/ml/cmd_chat.go` (interactive terminal token echo). All other consumers use `Backend.Generate()` only.
### Step 2.1 — Unify Message types
`ml.Message` was a separate struct identical to `inference.Message`. Replaced with a type alias:
```go
type Message = inference.Message
```
This eliminated the `convertMessages()` helper from `adapter.go` and all explicit conversion sites. Backward-compatible: all existing callers continue to use `ml.Message` and compile unchanged.
### Step 2.2 — Extend GenOpts
Added `TopK`, `TopP`, and `RepeatPenalty` to `ml.GenOpts` to match the fields available in `inference.GenerateConfig`. Updated `convertOpts()` in `adapter.go` to map the new fields. Existing callers that only set `Temperature`, `MaxTokens`, and `Model` continue to work unchanged.
**Field type note:** `inference.GenerateConfig` uses `float32` for temperature and sampling fields; `ml.GenOpts` uses `float64` to match the conventions in the rest of go-ml. `convertOpts()` performs the narrowing conversion explicitly.
### Step 2.3 — Deprecate StreamingBackend
Added deprecation comment to `StreamingBackend` in `inference.go`. The interface is not removed because `host-uk/cli` depends on it. Migration of those CLI files is out of scope for go-ml.
### Step 2.4 — Document backend architecture
Added the "Backend Architecture" section to `CLAUDE.md` documenting the two interface families, adapter directions, and migration guidance.
---
## Phase 3: Agent Loop Modernisation (Complete)
The original `agent.go` was a 1,070 LOC file mixing SSH commands, InfluxDB line protocol construction, probe evaluation, checkpoint discovery, and JSONL buffering. It had zero tests.
### Step 3.1 — Split into five files (Commit `eae9ec9`)
| File | LOC | Contents |
|------|-----|---------|
| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `GetScoredLabels`, `FindUnscored`, `ProcessOne`, `isMLXNative` |
| `agent_eval.go` | 397 | `processMLXNative`, `processWithConversion`, `RunCapabilityProbes`, `RunCapabilityProbesFull`, `RunContentProbes`, `ProbeResult` types |
| `agent_influx.go` | 291 | `ScoreCapabilityAndPush`, `ScoreContentAndPush`, `PushCapability*`, `BufferInfluxResult`, `ReplayInfluxBuffer` |
| `agent_ssh.go` | 102 | `SSHCommand`, `SCPFrom`, `SCPTo`, `fileBase`, `EnvOr`, `IntEnvOr`, `ExpandHome` |
`go build ./...`, `go test ./...`, and `go vet ./...` all passed after the split.
### Step 3.2 — Abstract SSH transport (Commit `1c2a6a6`)
Introduced the `RemoteTransport` interface with `Run`, `CopyFrom`, and `CopyTo` methods. `SSHTransport` implements this interface using the system `ssh` and `scp` binaries with functional options (`WithPort`, `WithTimeout`). `AgentConfig.Transport` accepts any `RemoteTransport`, with lazy initialisation to an `SSHTransport` when nil.
The old package-level functions `SSHCommand`, `SCPFrom`, and `SCPTo` are retained as deprecated wrappers that delegate to `AgentConfig.Transport`.
### Step 3.3 — Extract hardcoded infrastructure (Commit `12f3a1c`)
Extracted 15 constants from scattered magic values across 7 files:
- `EpochBase` — InfluxDB timestamp origin (Unix timestamp for 15 February 2025 00:00 UTC)
- Five InfluxDB measurement names (`MeasurementCapabilityScore`, `MeasurementCapabilityJudge`, `MeasurementContentScore`, `MeasurementProbeScore`, `MeasurementTrainingLoss`)
- Two DuckDB table names (`TableCheckpointScores`, `TableProbeResults`)
- Probe evaluation defaults (`CapabilityTemperature`, `CapabilityMaxTokens`, `ContentTemperature`, `ContentMaxTokens`, `MaxStoredResponseLen`)
- `InfluxBufferFile` — JSONL buffer filename
- `LogSeparatorWidth` — banner line width
Hardcoded probe counts replaced with `len(CapabilityProbes)` and `len(ContentProbes)`.
### Step 3.4 — Agent tests (Commit `3e22761`)
First test coverage for the agent subsystem:
- `AdapterMeta()` — 8 tests: known families (12 entries), variant suffixes, subdirectory patterns, unknown fallback, no-prefix edge case
- `FindUnscored()` — 5 tests: all unscored (sorted), some scored, all scored, empty input, nil scored map
- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — 4 tests: JSONL round-trip, multiple entries, empty file, missing file
- `DiscoverCheckpoints()` — 6 tests using `fakeTransport`: 3 checkpoints across 2 dirs, subdirectory pattern, no adapters, SSH error, filter pattern, no safetensors files
---
## Phase 4: Test Coverage (Complete, Commit `09bf403`)
Added four test files covering previously untested areas:
**`backend_llama_test.go`** (20 tests) — Uses `net/http/httptest` to mock the llama-server HTTP API. Covers: `Name`, `Available` (4 variants including process-not-started and health endpoint failure), `Generate` (6 variants including context cancellation, empty choices, and opts forwarding), `Chat` (3 variants), `Stop`, constructor (4 variants), and interface compliance.
**`backend_mlx_test.go`** (8 tests) — Uses a mock `inference.TextModel`. No build tag required — tests run on all platforms without Metal GPU hardware. Covers: `Generate`, `Chat`, streaming, model error after partial output, `Close`, direct model access via `Model()`, interface compliance, and `convertOpts` field mapping.
**`score_race_test.go`** (6 tests) — Race condition tests run with `-race`:
- `ConcurrentSemantic` — 20 responses scored with concurrency=4; verifies no data races on the result map
- `ConcurrentMixedSuites` — semantic + standard + content fan-out simultaneously
- `SemaphoreBoundary` — concurrency=1; verifies that at most 1 goroutine holds the semaphore at once
- `ContextCancellation` — 400 error response from judge returns nil semantic score without panicking
- `HeuristicOnlyNoRace` — 50 responses, heuristic only (no goroutines spawned); regression check
- `MultiModelConcurrent` — 4 models × 5 concurrent goroutines writing to the results map
**`benchmark_test.go`** (25 benchmarks, baselines on M3 Ultra):
- `HeuristicScore` — 5 input sizes (10010,000 characters): 25µs8.8ms
- `ExactMatch` — 4 patterns: 171ns2.1µs
- `JudgeExtractJSON` — 6 response variants: 2.53.4µs
- `Judge` round-trip — 2 suites (semantic, content): ~52µs
- `ScoreAll` — 2 modes (heuristic only, full): 25µs4.5ms
- Sub-components — 5 heuristic stages: 244ns88µs
---
## Known Limitations
### StreamingBackend retention
`ml.StreamingBackend` cannot be removed until `host-uk/cli/cmd/ml/cmd_serve.go` and `cmd/ml/cmd_chat.go` are migrated to use `inference.TextModel` iterators directly. That migration is out of scope for go-ml and must be tracked in the `host-uk/cli` repository.
### LlamaTextModel streaming gap
`LlamaTextModel` implements `inference.TextModel` but does not actually stream tokens — it yields the complete llama-server HTTP response as a single `Token`. True token-level streaming from llama-server would require implementing SSE parsing, which is a separate effort.
### Agent infrastructure coupling
`AgentConfig` contains fields (`M3Host`, `M3User`, `M3SSHKey`, `M3AdapterBase`, `InfluxURL`, `InfluxDB`) that are tightly coupled to a specific deployment topology (M3 Mac + InfluxDB on `10.69.69.165`). While the `RemoteTransport` abstraction decouples tests from SSH, production deployments still hardcode the M3 as the checkpoint host.
### EpochBase timestamp
The `EpochBase` constant (`1739577600`, corresponding to 15 February 2025 00:00 UTC) is embedded in InfluxDB line protocol timestamps. All capability/content/probe timestamps derive from this base plus checkpoint iteration offsets. Changing `EpochBase` would require re-writing all historical InfluxDB data.
### HTTPBackend classify
`HTTPTextModel.Classify` returns an "unsupported" error. There is no path to add classification support to an OpenAI-compatible HTTP backend without a dedicated classification endpoint or prompt engineering.
### DuckDB CGo
The `go-duckdb` dependency requires CGo. This prevents cross-compilation from macOS to Linux without a cross-compilation toolchain. Binaries that import go-ml will require a C compiler at build time.
---
## Future Considerations
- **ROCm backend**`go-rocm` provides a llama-server subprocess backend for AMD GPUs. Once published, it can be wrapped with `InferenceAdapter` in the same pattern as `backend_mlx.go`, gated with a `//go:build linux && amd64` constraint.
- **StreamingBackend removal** — Once `host-uk/cli` is migrated to `iter.Seq[Token]`, the `StreamingBackend` interface and `InferenceAdapter`'s `GenerateStream`/`ChatStream` methods can be removed.
- **go-i18n integration** — go-i18n Phase 2a requires 5,000 sentences/second classification throughput from Gemma3-1B. The `InferenceAdapter` and `inference.TextModel.BatchGenerate` provide the interface; the performance target depends on go-mlx's batching implementation.
- **LEM Lab pipeline wiring** — Integration tests for `backend_mlx.go` with a real model are deferred until the LEM Lab inference pipeline is fully wired. A smoke test against a small quantised model would confirm end-to-end Metal GPU inference through the go-inference abstraction.
- **Charm SSH** — The `SSHTransport` currently shells out to the system `ssh` and `scp` binaries. Replacing these with pure-Go SSH via `charmbracelet/keygen` and a native SSH client would eliminate the subprocess dependency and improve testability.