From 3918051112d01675592d2e0d7a5029c95d44cc0d Mon Sep 17 00:00:00 2001 From: Snider Date: Fri, 20 Feb 2026 15:01:55 +0000 Subject: [PATCH] docs: graduate TODO/FINDINGS into production documentation Replace internal task tracking (TODO.md, FINDINGS.md) with structured documentation in docs/. Trim CLAUDE.md to agent instructions only. Co-Authored-By: Virgil --- CLAUDE.md | 143 ++-------------- FINDINGS.md | 208 ------------------------ TODO.md | 193 ---------------------- docs/architecture.md | 378 +++++++++++++++++++++++++++++++++++++++++++ docs/development.md | 307 +++++++++++++++++++++++++++++++++++ docs/history.md | 194 ++++++++++++++++++++++ 6 files changed, 892 insertions(+), 531 deletions(-) delete mode 100644 FINDINGS.md delete mode 100644 TODO.md create mode 100644 docs/architecture.md create mode 100644 docs/development.md create mode 100644 docs/history.md diff --git a/CLAUDE.md b/CLAUDE.md index 92e5289..9298b0c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,43 +1,18 @@ -# CLAUDE.md — go-ml Domain Expert Guide +# CLAUDE.md — go-ml Agent Guide -You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates your work via TODO.md. Pick up tasks in phase order, mark `[x]` when done, commit and push. +You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates work. Pick up tasks in phase order, mark `[x]` when done, commit and push. ## What This Package Does -ML inference backends, scoring engine, and agent orchestrator. 7.5K LOC across 41 Go files. Provides: +ML inference backends, scoring engine, and agent orchestrator. ~7,500 LOC across 41 Go files. Provides: - **Pluggable inference backends** — MLX/Metal (darwin/arm64), llama.cpp (subprocess), HTTP/Ollama (OpenAI-compatible) -- **Multi-suite scoring engine** — Heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K) -- **23 capability probes** — Binary pass/fail tests across 16 categories (math, logic, code, etc.) -- **GGUF model management** — Format parsing, conversion, inventory +- **Multi-suite scoring engine** — heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K) +- **23 capability probes** — binary pass/fail tests across 16 categories +- **GGUF model management** — format parsing, conversion, inventory - **Agent orchestrator** — SSH checkpoint discovery, InfluxDB streaming, batch evaluation -## Critical Context: go-inference Migration - -**Phase 1 is complete.** Both directions of the bridge are implemented: - -1. **Forward adapter** (`adapter.go`): `inference.TextModel` (iter.Seq) -> `ml.Backend`/`ml.StreamingBackend` (string/callback). Used by `backend_mlx.go` to wrap Metal GPU models. -2. **Reverse adapters** (`backend_http_textmodel.go`): `HTTPBackend`/`LlamaBackend` -> `inference.TextModel`. Enables HTTP and llama-server backends to be used anywhere that expects a go-inference TextModel. - -### Interface Bridge (DONE) - -``` -ml.Backend (string) <──adapter.go──> inference.TextModel (iter.Seq[Token]) - <──backend_http_textmodel.go──> -``` - -- `InferenceAdapter`: TextModel -> Backend + StreamingBackend (for MLX, ROCm, etc.) -- `HTTPTextModel`: HTTPBackend -> TextModel (for remote APIs) -- `LlamaTextModel`: LlamaBackend -> TextModel (for managed llama-server) - -### backend_mlx.go (DONE) - -Rewritten from 253 LOC to ~35 LOC. Loads via `inference.LoadModel()` and wraps in `InferenceAdapter`. Uses go-mlx's Metal backend registered via `init()`. - -### Downstream Consumers Verified - -- `service.go` — `Service.Generate()` calls `Backend.Generate()`. InferenceAdapter satisfies Backend. No changes needed. -- `judge.go` — `Judge.judgeChat()` calls `Backend.Generate()`. Same contract, works as before. +See `docs/architecture.md` for the full architecture reference. ## Commands @@ -45,7 +20,7 @@ Rewritten from 253 LOC to ~35 LOC. Loads via `inference.LoadModel()` and wraps i go mod download # FIRST RUN: populate go.sum go test ./... # Run all tests go test -v -run TestHeuristic # Single test -go test -bench=. ./... # Benchmarks (none exist yet) +go test -bench=. ./... # Benchmarks go test -race ./... # Race detector go vet ./... # Static analysis ``` @@ -56,103 +31,16 @@ All resolve via `replace` directives in go.mod: | Module | Local Path | Notes | |--------|-----------|-------| -| `forge.lthn.ai/core/go` | `../host-uk/core` | Framework (ServiceRuntime, process, log) | +| `forge.lthn.ai/core/go` | `../go` | Framework (ServiceRuntime, process, log) | | `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend (darwin/arm64 only) | | `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Backend interfaces | -## Architecture - -### Backends (pluggable inference) - -| File | Backend | Status | -|------|---------|--------| -| `adapter.go` | InferenceAdapter (TextModel -> Backend) | DONE — bridges go-inference to ml.Backend | -| `backend_mlx.go` | MLX/Metal GPU | DONE — uses go-inference LoadModel + InferenceAdapter | -| `backend_http.go` | HTTP API (OpenAI-compatible) | Works as ml.Backend | -| `backend_http_textmodel.go` | HTTPTextModel + LlamaTextModel | DONE — reverse wrappers (Backend -> TextModel) | -| `backend_llama.go` | llama-server subprocess | Works as ml.Backend | -| `ollama.go` | Ollama helpers | Works | - -### Scoring Engine - -| File | LOC | Purpose | -|------|-----|---------| -| `score.go` | 212 | Concurrent scoring orchestrator (semaphore-bounded workers) | -| `heuristic.go` | 258 | 9 regex-based metrics, LEK composite score | -| `judge.go` | 205 | LLM-as-judge (6 scoring methods) | -| `exact.go` | 77 | GSM8K exact-match with numeric extraction | -| `probes.go` | 273 | 23 binary capability probes across 16 categories | - -### Data Pipeline - -| File | LOC | Purpose | -|------|-----|---------| -| `agent.go` | 1,070 | Scoring agent (SSH checkpoint discovery, InfluxDB) | -| `worker.go` | 403 | LEM API worker for distributed inference | -| `service.go` | 162 | Core framework integration (lifecycle, backend registry) | -| `ingest.go` | 384 | JSONL response loading | -| `db.go` | 258 | DuckDB analytics storage | -| `gguf.go` | 369 | GGUF model format parsing | - -### Backend Architecture - -Two interface families coexist, bridged by adapters: - -**`inference.TextModel`** (iterator-based) is the **preferred API** for new code. Returns `iter.Seq[inference.Token]` for streaming. Defined in `forge.lthn.ai/core/go-inference`. Use this for GPU backends (MLX Metal, ROCm) and any code that needs token-level control. - -**`ml.Backend`** (string-based) is the **compatibility layer**, still fully supported. Returns complete strings. Used by `service.go`, `judge.go`, and external consumers like `host-uk/cli`. - -**`ml.StreamingBackend`** is **deprecated**. New code should use `inference.TextModel` with `iter.Seq[Token]` directly. Retained for backward compatibility with existing callers. - -**Adapters:** - -| Adapter | Direction | File | -|---------|-----------|------| -| `InferenceAdapter` | `inference.TextModel` -> `ml.Backend` + `ml.StreamingBackend` | `adapter.go` | -| `HTTPTextModel` | `ml.HTTPBackend` -> `inference.TextModel` | `backend_http_textmodel.go` | -| `LlamaTextModel` | `ml.LlamaBackend` -> `inference.TextModel` | `backend_http_textmodel.go` | - -**Unified types (Phase 2):** - -- `ml.Message` is a type alias for `inference.Message` — the types are identical, no conversion needed between packages. -- `ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model overrides. The `convertOpts()` helper maps GenOpts to `[]inference.GenerateOption`. - -### Key Types - -```go -// Backend interface (inference.go) — compatibility layer -type Backend interface { - Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) - Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error) - Name() string - Available() bool -} - -// Deprecated: use inference.TextModel with iter.Seq[Token] directly -type StreamingBackend interface { - Backend - GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error - ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error -} - -type GenOpts struct { - Temperature float64 - MaxTokens int - Model string // override model for this request - TopK int // top-k sampling (0 = disabled) - TopP float64 // nucleus sampling threshold (0 = disabled) - RepeatPenalty float64 // repetition penalty (0 = disabled, 1.0 = no penalty) -} - -// Type alias — identical to inference.Message -type Message = inference.Message -``` - ## Coding Standards -- **UK English**: colour, organisation, centre -- **Tests**: testify assert/require (existing), Pest-style names welcome for new tests -- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(mlx):` +- **UK English**: colour, organisation, centre, licence (noun) +- **SPDX header**: `// SPDX-Licence-Identifier: EUPL-1.2` in every new source file +- **Tests**: testify assert/require; `_Good`/`_Bad`/`_Ugly` suffix pattern +- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(agent):` - **Co-Author**: `Co-Authored-By: Virgil ` - **Licence**: EUPL-1.2 - **Imports**: stdlib → forge.lthn.ai → third-party, each group separated by blank line @@ -161,8 +49,3 @@ type Message = inference.Message - **Repo**: `forge.lthn.ai/core/go-ml` - **Push via SSH**: `git push forge main` (remote: `ssh://git@forge.lthn.ai:2223/core/go-ml.git`) - -## Task Queue - -See `TODO.md` for prioritised work. Phase 1 (go-inference migration) is the critical path. -See `FINDINGS.md` for research notes and interface mapping. diff --git a/FINDINGS.md b/FINDINGS.md deleted file mode 100644 index ae42e4f..0000000 --- a/FINDINGS.md +++ /dev/null @@ -1,208 +0,0 @@ -# FINDINGS.md — go-ml Research & Discovery - -## 2026-02-19: Split from go-ai (Virgil) - -### Origin - -Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only. - -### What Was Extracted - -- 41 Go files (~7,494 LOC excluding tests) -- 6 test files (backend_http, exact, heuristic, judge, probes, score) -- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub). - -### Dependencies - -- `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only) -- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1) -- `forge.lthn.ai/core/go` — Framework services, process management, logging -- `github.com/marcboeker/go-duckdb` — Analytics storage -- `github.com/parquet-go/parquet-go` — Columnar data I/O -- `github.com/stretchr/testify` — Test assertions - -### Consumers - -- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`) -- LEM Lab — Uses MLXBackend for chat inference -- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference) - -## go-inference Interface Mapping - -### Type Correspondence - -| go-ml | go-inference | Notes | -|-------|-------------|-------| -| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel | -| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming | -| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. | -| `ml.Message` | `inference.Message` | Identical struct: Role + Content | -| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks | -| (no equivalent) | `inference.Token` | `{ID int32, Text string}` | -| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] | - -### Method Mapping - -``` -ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error) - ↕ InferenceAdapter collects tokens -inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token] - -ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error - ↕ InferenceAdapter forwards tokens to callback -inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token] - -ml.GenOpts{Temperature: 0.7, MaxTokens: 2048} - ↕ convertOpts helper -inference.WithTemperature(0.7), inference.WithMaxTokens(2048) -``` - -### backend_mlx.go Before/After - -**Before** (253 LOC — BROKEN, old subpackage imports): -```go -import ( - "forge.lthn.ai/core/go-mlx" - "forge.lthn.ai/core/go-mlx/cache" // REMOVED - "forge.lthn.ai/core/go-mlx/model" // REMOVED - "forge.lthn.ai/core/go-mlx/sample" // REMOVED - "forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED -) - -type MLXBackend struct { - model model.Model - tok *tokenizer.Tokenizer - caches []cache.Cache - sampler sample.Sampler - // ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup -} -``` - -**After** (~60 LOC — uses go-inference + InferenceAdapter): -```go -import ( - "forge.lthn.ai/core/go-inference" - _ "forge.lthn.ai/core/go-mlx" // registers "metal" backend via init() -) - -func NewMLXBackend(modelPath string) (*InferenceAdapter, error) { - m, err := inference.LoadModel(modelPath) - if err != nil { return nil, fmt.Errorf("mlx: %w", err) } - return &InferenceAdapter{model: m, name: "mlx"}, nil -} -``` - -All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface. - -## Scoring Engine Architecture - -### 5 Suites - -| Suite | Method | LLM needed? | Metrics | -|-------|--------|-------------|---------| -| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite | -| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) | -| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) | -| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen | -| **Exact** | Numeric extraction | No | GSM8K answer matching | - -### LEK Score Formula - -``` -LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5 - - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20 -``` - -Positive signals: engagement depth, creative form, emotional register, first-person voice. -Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output. - -### Concurrency Model - -`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex. - -## Phase 2 Audit: StreamingBackend Usage (Virgil, 20 Feb 2026) - -### Callers of GenerateStream/ChatStream - -Only 2 files across the entire ecosystem call StreamingBackend methods: - -1. **`host-uk/cli/cmd/ml/cmd_serve.go`** (lines 146, 201, 319) - - Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming - - `/v1/completions` → `streamer.GenerateStream()` (line 201) - - `/v1/chat/completions` → `streamer.ChatStream()` (line 319) - - Has non-streaming fallback: `backend.Generate()` when assertion fails - -2. **`host-uk/cli/cmd/ml/cmd_chat.go`** - - Direct `ChatStream()` call for terminal token-by-token echo - - No fallback — assumes backend supports streaming - -### Non-streaming consumers (use Backend.Generate only) - -| File | Method | Notes | -|------|--------|-------| -| service.go | `Backend.Generate()` | Backend registry dispatch | -| judge.go | `Backend.Generate()` | Via judgeChat() | -| agent.go | `Backend.Generate()` | Probe evaluation | -| expand.go | `Backend.Generate()` | Prompt expansion | -| go-ai/mcp/tools_ml.go | `ml.Service` | Via service layer | - -### Backend Implementation Status - -| Backend | Backend? | StreamingBackend? | Notes | -|---------|----------|-------------------|-------| -| InferenceAdapter | YES | YES | Bridges iter.Seq[Token] → callbacks | -| HTTPBackend | YES | NO | Returns complete string from API | -| LlamaBackend | YES | NO | Returns complete string via HTTP | - -### Conclusion - -StreamingBackend is only needed by `host-uk/cli` (2 files, out of go-ml scope). Safe to deprecate in go-ml with a comment. The actual migration of those CLI files is a separate task for the cli repo. - -### GenOpts vs GenerateConfig Field Comparison - -| ml.GenOpts | inference.GenerateConfig | Type | -|-----------|--------------------------|------| -| Temperature | Temperature | float64 vs float32 | -| MaxTokens | MaxTokens | int (same) | -| Model | (none) | string | -| (none) | TopK | int | -| (none) | TopP | float32 | -| (none) | StopTokens | []int32 | -| (none) | RepeatPenalty | float32 | -| (none) | ReturnLogits | bool | - -## Known Issues - -- ~~**backend_mlx.go imports dead subpackages**~~ — FIXED in Phase 1 (`c3c2c14`) -- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together -- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go -- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists -- **score.go concurrency untested** — No race condition tests -- ~~**Message type duplication**~~ — FIXED in Phase 2 (`747e703`): type alias `Message = inference.Message` - -## Phase 3 Audit: agent.go Structure (Virgil, 20 Feb 2026) - -### File Layout (1,070 LOC) - -| Section | Lines | LOC | Purpose | -|---------|-------|-----|---------| -| Types & Config | 19–112 | ~95 | `AgentConfig`, `Checkpoint`, config maps, `AdapterMeta()` | -| Main Loop | 141–343 | ~200 | `RunAgentLoop()`, checkpoint discovery, unscored filtering | -| Evaluation | 345–700 | ~355 | MLX-native + conversion paths, 4 probe functions | -| Judge & Push | 708–887 | ~180 | Scoring, InfluxDB line protocol, DuckDB dual-write | -| Buffering | 926–977 | ~50 | JSONL buffer for InfluxDB failures | -| SSH/SCP | 979–1070 | ~90 | `SSHCommand()`, `SCPFrom()`, `SCPTo()`, utility helpers | - -### Hardcoded Infrastructure - -- SSH options duplicated across 3 functions: `ConnectTimeout=10, BatchMode=yes, StrictHostKeyChecking=no` -- InfluxDB timestamp base: `1739577600` (13 Feb 2026 00:00 UTC) -- InfluxDB measurements: `probe_score`, `capability_score`, `capability_judge`, `content_score` -- DuckDB tables: `checkpoint_scores`, `probe_results` - -### Test Coverage - -Zero tests for agent.go. Testable without infrastructure: -- `AdapterMeta()` — pure function, dirname → metadata -- `FindUnscored()` — filtering logic -- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — JSONL round-trip diff --git a/TODO.md b/TODO.md deleted file mode 100644 index 54ffe67..0000000 --- a/TODO.md +++ /dev/null @@ -1,193 +0,0 @@ -# TODO.md — go-ml Task Queue - -Dispatched from Virgil in core/go. Pick up tasks in phase order. - ---- - -## Phase 1: go-inference Migration (CRITICAL PATH) - -Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring. - -### Step 1.1: Add go-inference dependency - -- [x] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves. - -### Step 1.2: Write the InferenceAdapter - -- [x] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement: - - `Generate()` — collect tokens from iterator into string - - `Chat()` — same, using `TextModel.Chat()` - - `GenerateStream()` — forward tokens to `TokenCallback` - - `ChatStream()` — same for chat - - `Name()` — delegate to `TextModel.ModelType()` - - `Available()` — always true (model already loaded) - - `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options - - **Key mapping**: - ``` - GenOpts.Temperature → inference.WithTemperature(float32(t)) - GenOpts.MaxTokens → inference.WithMaxTokens(n) - GenOpts.Model → (ignored, model already loaded) - ``` - - **Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled). - -- [x] **Test adapter.go** — 13 test cases with mock TextModel (all pass). Test cases: - - Normal generation (collect tokens → string) - - Streaming (each token hits callback) - - Callback error stops iteration - - Context cancellation propagates - - Empty output (EOS immediately) - - Model error after partial output - -### Step 1.3: Rewrite backend_mlx.go - -- [x] **Replace backend_mlx.go** — Deleted the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replaced with ~35 LOC: - ```go - //go:build darwin && arm64 - - package ml - - import ( - "forge.lthn.ai/core/go-inference" - _ "forge.lthn.ai/core/go-mlx" // registers "metal" backend - ) - - func NewMLXBackend(modelPath string) (*InferenceAdapter, error) { - m, err := inference.LoadModel(modelPath) - if err != nil { - return nil, fmt.Errorf("mlx: %w", err) - } - return &InferenceAdapter{model: m, name: "mlx"}, nil - } - ``` - The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic. - -- [x] **Preserve memory controls** — Deferred: go-mlx handles cache/memory limits internally. Callers can use `mlx.SetCacheLimit()`/`mlx.SetMemoryLimit()` directly. No wrapper needed until a concrete use case arises. - -- [x] **Test backend_mlx.go** — Covered by Phase 4 `backend_mlx_test.go` (8 tests via mock TextModel). Integration smoke test with real model deferred until LEM Lab pipeline is wired. - -### Step 1.4: HTTPBackend and LlamaBackend wrappers - -- [x] **HTTPBackend go-inference wrapper** — `backend_http_textmodel.go`: `HTTPTextModel` wraps `HTTPBackend` to implement `inference.TextModel`. Generate/Chat yield entire response as single Token. Classify returns unsupported error. BatchGenerate processes prompts sequentially. 17 tests pass. - -- [x] **LlamaBackend go-inference wrapper** — `backend_http_textmodel.go`: `LlamaTextModel` embeds `HTTPTextModel`, overrides `ModelType()` -> "llama" and `Close()` -> `llama.Stop()`. 2 tests pass. - -### Step 1.5: Verify downstream consumers - -- [x] **Service.Generate() still works** — `service.go` calls `Backend.Generate()`. InferenceAdapter satisfies ml.Backend. HTTPBackend/LlamaBackend still implement ml.Backend directly. No changes needed. -- [x] **Judge still works** — `judge.go` calls `Backend.Generate()` via `judgeChat()`. Same Backend contract, works as before. No changes needed. -- [x] **go-ai tools_ml.go** — Uses `ml.Service` directly. `ml.Backend` interface is preserved, no code changes needed in go-ai. - ---- - -## Phase 2: Backend Consolidation - -After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile. - -### Audit Results (Virgil, 20 Feb 2026) - -**StreamingBackend callers** — Only 2 files in `host-uk/cli`: -- `cmd/ml/cmd_serve.go` lines 146,201,319: Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming at `/v1/completions` and `/v1/chat/completions` -- `cmd/ml/cmd_chat.go`: Direct `ChatStream()` call for interactive terminal token echo - -All other consumers (service.go, judge.go, agent.go, expand.go, go-ai tools_ml.go) use `Backend.Generate()` — NOT streaming. - -**Backend implementations**: -- `InferenceAdapter` → implements Backend + StreamingBackend (via go-inference iter.Seq) -- `HTTPBackend` → implements Backend only (no streaming) -- `LlamaBackend` → implements Backend only (no streaming) - -### Step 2.1: Unify Message types - -- [x] **Type alias ml.Message → inference.Message** — In `inference.go`, replace the `Message` struct with: - ```go - type Message = inference.Message - ``` - This is backward-compatible — all existing callers keep working. Remove the `convertMessages()` helper from `adapter.go` since types are now identical. Verify with `go build ./...` and `go test ./...`. - -### Step 2.2: Unify GenOpts - -- [x] **Add inference fields to GenOpts** — Extend `ml.GenOpts` to include the extra fields from `inference.GenerateConfig`: - ```go - type GenOpts struct { - Temperature float64 - MaxTokens int - Model string // override model for this request - TopK int // NEW: from inference.GenerateConfig - TopP float64 // NEW: from inference.GenerateConfig (float64 to match Temperature) - RepeatPenalty float64 // NEW: from inference.GenerateConfig - } - ``` - Update `convertOpts()` in adapter.go to map the new fields. Existing callers that only set Temperature/MaxTokens/Model continue working unchanged. - -### Step 2.3: Deprecate StreamingBackend - -- [x] **Mark StreamingBackend as deprecated** — Add deprecation comment: - ```go - // Deprecated: StreamingBackend is retained for backward compatibility. - // New code should use inference.TextModel with iter.Seq[Token] directly. - // See InferenceAdapter for the bridge pattern. - type StreamingBackend interface { ... } - ``` - Do NOT remove yet — `host-uk/cli` cmd_serve.go and cmd_chat.go still depend on it. Those migrations are out of scope for go-ml (they live in a different repo). - -### Step 2.4: Document migration path - -- [x] **Update CLAUDE.md** — Add "Backend Architecture" section documenting: - - `inference.TextModel` (iterator-based) is the preferred API for new code - - `ml.Backend` (string-based) is the compatibility layer, still supported - - `StreamingBackend` is deprecated, use `iter.Seq[Token]` directly - - `InferenceAdapter` bridges TextModel → Backend/StreamingBackend - - `HTTPTextModel`/`LlamaTextModel` bridges Backend → TextModel (reverse direction) - ---- - -## Phase 3: Agent Loop Modernisation - -`agent.go` (1,070 LOC) is the largest file with SSH, InfluxDB, scoring, and publishing mixed together. Decompose into focused files. - -### Step 3.1: Split agent.go into 5 files — COMPLETE - -- [x] **Split `agent.go` (1,070 LOC) into 5 focused files** — Commit `eae9ec9`. All `go build/test/vet` pass: - - `agent_config.go` (97 LOC): AgentConfig, Checkpoint, BaseModelMap, ModelFamilies, AdapterMeta() - - `agent_execute.go` (215 LOC): RunAgentLoop, DiscoverCheckpoints, GetScoredLabels, FindUnscored, ProcessOne, isMLXNative - - `agent_eval.go` (397 LOC): processMLXNative, processWithConversion, RunCapabilityProbes/Full, RunContentProbes, ProbeResult types - - `agent_influx.go` (291 LOC): ScoreCapabilityAndPush, ScoreContentAndPush, PushCapability*, BufferInfluxResult, ReplayInfluxBuffer - - `agent_ssh.go` (102 LOC): SSHCommand, SCPFrom, SCPTo, fileBase, EnvOr, IntEnvOr, ExpandHome - -### Step 3.2: Abstract SSH transport — COMPLETE - -- [x] **RemoteTransport interface + SSHTransport** — Commit `1c2a6a6`. Interface with Run/CopyFrom/CopyTo, SSHTransport implementation with functional options (WithPort, WithTimeout). AgentConfig.Transport field with lazy init. All callers updated (DiscoverCheckpoints, processMLXNative, processWithConversion). Old SSHCommand/SCPFrom/SCPTo preserved as deprecated wrappers. Build/test/vet clean. - -### Step 3.3: Configurable infrastructure — COMPLETE - -- [x] **Extract hardcoded values to constants** — Commit `12f3a1c`. 15 constants in agent_config.go: EpochBase, 5 InfluxDB measurements, 2 DuckDB tables, probe defaults (temp/maxTokens/truncation), InfluxBufferFile, LogSeparatorWidth, InterCheckpointDelay. Hardcoded probe counts replaced with len(). 7 files, build/test/vet clean. - -### Step 3.4: Agent tests — COMPLETE - -- [x] **Test `AdapterMeta()`** — 8 tests: known families (12 entries), variant suffix, subdirectory patterns, unknown fallback, no-prefix edge case. Commit `3e22761`. -- [x] **Test `FindUnscored()`** — 5 tests: all unscored (sorted), some scored, all scored, empty input, nil scored map. Commit `3e22761`. -- [x] **Test `BufferInfluxResult()`/`ReplayInfluxBuffer()`** — 4 tests: JSONL round-trip, multiple entries, empty file, missing file. Commit `3e22761`. -- [x] **Test `DiscoverCheckpoints()`** — 6 tests: happy path (3 checkpoints across 2 dirs), subdirectory pattern, no adapters, SSH error, filter pattern, no safetensors. Uses `fakeTransport` mock implementing `RemoteTransport`. Commit `3e22761`. - ---- - -## Phase 4: Test Coverage — COMPLETE - -All 4 test files created and verified with `go test -race ./...`. Commit `09bf403`. - -- [x] **backend_llama_test.go** — 20 tests via httptest mock: Name, Available (4 variants), Generate (6 variants incl. context cancellation, empty choices, opts forwarding), Chat (3 variants), Stop, constructor (4 variants), interface compliance. -- [x] **backend_mlx_test.go** — 8 tests via mock TextModel (no build tag needed): Generate, Chat, Stream, ModelError, Close, ModelAccess, InterfaceCompliance, ConvertOpts. -- [x] **score_race_test.go** — 6 race-condition tests: ConcurrentSemantic (20 responses, concurrency=4), ConcurrentMixedSuites (semantic+standard+content fan-out), SemaphoreBoundary (concurrency=1, verifies max concurrent==1), ContextCancellation (400 error→nil semantic), HeuristicOnlyNoRace (50 responses), MultiModelConcurrent (4 models×5 concurrent map writes). -- [x] **benchmark_test.go** — 25 benchmarks: HeuristicScore (5 sizes: 25µs–8.8ms), ExactMatch (4 patterns: 171ns–2.1µs), JudgeExtractJSON (6 variants: 2.5–3.4µs), Judge round-trip (2 suites: ~52µs), ScoreAll (2 modes: 25µs–4.5ms), sub-components (5 heuristic stages: 244ns–88µs). Baselines on M3 Ultra. - ---- - -## Workflow - -1. Virgil in core/go writes tasks here after research -2. This repo's session picks up tasks in phase order -3. Mark `[x]` when done, note commit hash -4. New discoveries → add tasks, note in FINDINGS.md -5. Push to forge after each completed step: `git push forge main` diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..5a6ef78 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,378 @@ +# go-ml Architecture + +## Overview + +`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files. + +The package provides three distinct subsystems: + +1. **Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs. +2. **Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites. +3. **Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation. + +--- + +## Dependency Graph + +``` +forge.lthn.ai/core/go-ml + ├── forge.lthn.ai/core/go-inference (shared TextModel/Token interfaces) + │ └── (no further Core deps) + ├── forge.lthn.ai/core/go-mlx (Metal GPU inference, darwin/arm64 only) + │ └── forge.lthn.ai/core/go-inference + ├── forge.lthn.ai/core/go (ServiceRuntime, process, log) + ├── github.com/marcboeker/go-duckdb (analytics storage) + └── github.com/parquet-go/parquet-go (columnar data I/O) +``` + +### Role of each dependency + +| Module | Purpose | +|--------|---------| +| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. | +| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. | +| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. | +| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. | +| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. | + +--- + +## Backend Architecture + +Two interface families coexist within go-ml, connected by a set of adapters. + +### The `ml.Backend` interface (compatibility layer) + +```go +type Backend interface { + Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) + Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error) + Name() string + Available() bool +} +``` + +`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface. + +### The `inference.TextModel` interface (preferred for new code) + +Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`. + +### `ml.StreamingBackend` (deprecated) + +```go +// Deprecated: use inference.TextModel with iter.Seq[Token] directly. +type StreamingBackend interface { + Backend + GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error + ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error +} +``` + +Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it. + +### Type unification + +`ml.Message` is a type alias for `inference.Message`: + +```go +type Message = inference.Message +``` + +The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages. + +`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection: + +```go +type GenOpts struct { + Temperature float64 + MaxTokens int + Model string // per-request model override; ignored by GPU backends + TopK int + TopP float64 + RepeatPenalty float64 +} +``` + +--- + +## Backend Implementations + +### HTTPBackend (`backend_http.go`) + +Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server). + +- Implements `ml.Backend` only (no streaming — returns complete response strings). +- Retries up to 3 times with exponential backoff on 5xx and connection errors. +- 300-second HTTP client timeout suitable for long-running inference. + +### LlamaBackend (`backend_llama.go`) + +Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`. + +- Implements `ml.Backend`. +- `Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds. +- `Stop()` kills the managed process via the Core `process.Service`. +- Supports optional LoRA adapter loading via `--lora`. + +### InferenceAdapter (`adapter.go`) + +Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem. + +``` +inference.TextModel (iter.Seq[Token]) + │ + └─── InferenceAdapter ───► ml.Backend (string) + ───► ml.StreamingBackend (TokenCallback) +``` + +Key behaviours: + +- `Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors. +- `GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops. +- `Available()` always returns `true` — the model is already loaded when the adapter is constructed. +- `Close()` delegates to `TextModel.Close()`, releasing GPU memory. + +### MLX Backend (`backend_mlx.go`, darwin/arm64 only) + +```go +//go:build darwin && arm64 + +func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) { + m, err := inference.LoadModel(modelPath, loadOpts...) + // ... + return NewInferenceAdapter(m, "mlx"), nil +} +``` + +The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon. + +The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package. + +### Reverse adapters (`backend_http_textmodel.go`) + +Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`). + +| Type | Wraps | Notes | +|------|-------|-------| +| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. | +| `LlamaTextModel` | `*LlamaBackend` | Embeds `HTTPTextModel`; overrides `ModelType()` → `"llama"` and `Close()` → `llama.Stop()`. | + +### Adapter map (all directions) + +``` +ml.Backend (string) <──── InferenceAdapter ──── inference.TextModel (iter.Seq[Token]) + (adapter.go) + +ml.HTTPBackend ──── HTTPTextModel ────► inference.TextModel +ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel + (backend_http_textmodel.go) +``` + +--- + +## Service Layer (`service.go`) + +`Service` integrates go-ml into the Core framework lifecycle: + +```go +core.New( + framework.WithName("ml", ml.NewService(ml.Options{ + OllamaURL: "http://localhost:11434", + JudgeURL: "http://localhost:11434", + JudgeModel: "qwen3:8b", + Concurrency: 4, + Suites: "all", + })), +) +``` + +`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`. + +--- + +## Scoring Engine + +### Engine (`score.go`) + +`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently. + +``` +ScoreAll(responses []Response) map[string][]PromptScore + │ + ├── Heuristic (inline, no goroutine) + └── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded) +``` + +The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out. + +Suites are selected at engine construction time via a comma-separated string or `"all"`. + +### Heuristic scoring (`heuristic.go`) + +Analyses a response using pre-compiled regular expressions. No LLM is needed. + +Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score: + +``` +LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5 + - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20 +``` + +**Positive signals** + +| Sub-score | What it measures | +|-----------|-----------------| +| `EngagementDepth` | Structural markers (headings, bold), ethical vocabulary, technical depth, word count | +| `CreativeForm` | Poetry structure (short lines), narrative openings, metaphor density | +| `EmotionalRegister` | Emotional vocabulary (feel, grief, compassion, etc.) | +| `FirstPerson` | Sentences beginning with "I" or containing first-person agency verbs | + +**Negative signals** + +| Sub-score | What it measures | +|-----------|-----------------| +| `ComplianceMarkers` | RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") | +| `FormulaicPreamble` | Opener templates ("Sure, let's...", "Great question") | +| `Degeneration` | Sentence repetition ratio (looping/stuck output) | +| `EmptyBroken` | Empty, error-prefixed, or pad-token-polluted responses | + +### Judge (`judge.go`) + +`Judge` uses any `Backend` as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response. + +```go +judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b")) +scores, err := judge.ScoreSemantic(ctx, prompt, response) +``` + +JSON extraction (`extractJSON`) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences. + +Six scoring methods are available: + +| Method | Suite | Dimensions | +|--------|-------|-----------| +| `ScoreSemantic` | semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept | +| `ScoreContent` | content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister | +| `ScoreCapability` | (agent) | Reasoning, Correctness, Clarity | +| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness | +| `ScoreDoNotAnswer` | standard | Safety, Nuance | +| `ScoreToxigen` | standard | Kindness, Awareness | + +### Exact match (`exact.go`) + +`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required. + +### Capability probes (`probes.go`) + +23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response. + +| Category | Probes | Examples | +|----------|--------|---------| +| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci | +| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality | +| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions | +| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug | +| Word problems (2) | word | speed/distance, sibling counting | + +`StripThinkBlocks()` removes `...` sections from DeepSeek R1 responses before checking. + +--- + +## Agent Orchestrator + +The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3"). + +### Files + +| File | LOC | Responsibility | +|------|-----|---------------| +| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` | +| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne` | +| `agent_eval.go` | 397 | MLX-native and conversion evaluation paths, capability and content probe runners | +| `agent_influx.go` | 291 | InfluxDB line-protocol push, JSONL buffer for offline replay | +| `agent_ssh.go` | 102 | `RemoteTransport` interface, `SSHTransport` implementation, utility helpers | + +### Workflow + +``` +RunAgentLoop + │ + ├── ReplayInfluxBuffer (flush any buffered writes from previous failures) + ├── DiscoverCheckpoints (SSH ls on M3 adapter directories) + ├── GetScoredLabels (InfluxDB query for already-scored (run_id, label) pairs) + ├── FindUnscored (set difference, sorted by dirname + iteration) + └── ProcessOne (for each unscored checkpoint) + │ + ├── isMLXNative? YES → processMLXNative (serve directly via mlx_lm.server) + │ NO → processWithConversion (MLX→GGUF, then llama-server) + │ + ├── RunCapabilityProbes (23 binary probes) + ├── RunContentProbes (sovereignty probes) + ├── ScoreCapabilityAndPush (judge + InfluxDB) + └── ScoreContentAndPush (judge + InfluxDB) +``` + +### RemoteTransport + +`RemoteTransport` abstracts SSH/SCP so that tests can supply an in-memory fake: + +```go +type RemoteTransport interface { + Run(ctx context.Context, cmd string) (string, error) + CopyFrom(ctx context.Context, remote, local string) error + CopyTo(ctx context.Context, local, remote string) error +} +``` + +`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`. + +### Checkpoint discovery + +`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename. + +`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`. + +### Persistence + +Results are written to two stores simultaneously: + +- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`. +- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`. + +If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer. + +--- + +## Data Pipeline + +| File | Purpose | +|------|---------| +| `ingest.go` | Load JSONL response files into `[]Response` slices | +| `db.go` | DuckDB schema creation, insert, and query helpers | +| `influx.go` | InfluxDB HTTP client (line protocol write, SQL query) | +| `gguf.go` | GGUF file format parsing (magic, version, metadata, tensor inventory) | +| `worker.go` | LEM API worker for distributed inference job dispatch | +| `expand.go` | Prompt expansion using a backend | +| `normalize.go` | Response normalisation utilities | +| `parquet.go` | Parquet dataset export | + +--- + +## Test Coverage + +| File | Tests | What is covered | +|------|-------|----------------| +| `adapter_test.go` | 13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors | +| `backend_http_test.go` | — | HTTPBackend: generate, chat, retries, status codes | +| `backend_http_textmodel_test.go` | 19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch | +| `backend_llama_test.go` | 20 | LlamaBackend: start, stop, health, generate, chat, constructor variants | +| `backend_mlx_test.go` | 8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion | +| `heuristic_test.go` | — | All nine heuristic sub-scores and LEK formula | +| `judge_test.go` | — | JSON extraction variants, ScoreSemantic, ScoreContent | +| `exact_test.go` | — | Numeric extraction patterns | +| `probes_test.go` | — | All 23 capability probe Check functions | +| `score_test.go` | — | Engine suite selection, ScoreAll grouping | +| `score_race_test.go` | 6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes | +| `agent_test.go` | 23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport | +| `benchmark_test.go` | 25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) | diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 0000000..04ce72c --- /dev/null +++ b/docs/development.md @@ -0,0 +1,307 @@ +# go-ml Development Guide + +## Prerequisites + +### Required + +- **Go 1.25** or later (the module uses `go 1.25.5`) +- **Go workspace** — go-ml is part of the `host-uk/core` Go workspace; `replace` directives in `go.mod` resolve sibling modules from local paths + +### Required sibling modules (local paths) + +| Module | Local path | Notes | +|--------|-----------|-------| +| `forge.lthn.ai/core/go` | `../go` | Framework, process management, logging | +| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Token interfaces | +| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend | + +All three must be checked out as siblings of `go-ml` (i.e. all four directories share the same parent). + +### Platform-specific + +- **Metal GPU (`NewMLXBackend`)** — requires macOS on Apple Silicon (darwin/arm64). The `backend_mlx.go` file carries a `//go:build darwin && arm64` build tag and is excluded on other platforms. All other features work on Linux and amd64. +- **llama-server** — the `llama-server` binary from llama.cpp must be on `PATH` or the path provided in `LlamaOpts.LlamaPath`. +- **DuckDB** — uses CGo; a C compiler (`gcc` or `clang`) is required. + +--- + +## Getting Started + +```bash +# On first checkout, populate go.sum +go mod download + +# Verify the build (all platforms) +go build ./... + +# Verify the build excluding Metal backend (Linux / CI) +GOFLAGS='-tags nomlx' go build ./... +``` + +--- + +## Build and Test Commands + +```bash +# Run all tests +go test ./... + +# Run with race detector (recommended before committing) +go test -race ./... + +# Run a single test by name +go test -v -run TestHeuristic ./... +go test -v -run TestEngine_ScoreAll_ConcurrentSemantic ./... + +# Run benchmarks +go test -bench=. ./... +go test -bench=BenchmarkHeuristicScore ./... + +# Static analysis +go vet ./... + +# Tidy dependencies +go mod tidy +``` + +--- + +## Test Patterns + +### Naming convention + +Tests use a `_Good`, `_Bad`, `_Ugly` suffix pattern: + +- `_Good` — happy path (expected success) +- `_Bad` — expected error conditions (invalid input, unreachable server) +- `_Ugly` — panic and edge-case paths + +### Mock backends + +For tests that exercise `Backend`-dependent code (judge, agent, scoring engine) without a real inference server, implement `Backend` directly: + +```go +type mockBackend struct { + response string + err error +} + +func (m *mockBackend) Generate(_ context.Context, _ string, _ ml.GenOpts) (string, error) { + return m.response, m.err +} +func (m *mockBackend) Chat(_ context.Context, _ []ml.Message, _ ml.GenOpts) (string, error) { + return m.response, m.err +} +func (m *mockBackend) Name() string { return "mock" } +func (m *mockBackend) Available() bool { return true } +``` + +### Mock TextModel + +For tests that exercise `InferenceAdapter` without Metal GPU hardware, implement `inference.TextModel`: + +```go +type mockTextModel struct { + tokens []string + err error +} + +func (m *mockTextModel) Generate(ctx context.Context, prompt string, opts ...inference.GenerateOption) iter.Seq[inference.Token] { + return func(yield func(inference.Token) bool) { + for _, t := range m.tokens { + if !yield(inference.Token{Text: t}) { + return + } + } + } +} +// ... implement remaining TextModel methods +func (m *mockTextModel) Err() error { return m.err } +``` + +### Mock RemoteTransport + +For agent tests that would otherwise require an SSH connection: + +```go +type fakeTransport struct { + outputs map[string]string + errors map[string]error +} + +func (f *fakeTransport) Run(_ context.Context, cmd string) (string, error) { + if err, ok := f.errors[cmd]; ok { + return "", err + } + return f.outputs[cmd], nil +} +func (f *fakeTransport) CopyFrom(_ context.Context, _, _ string) error { return nil } +func (f *fakeTransport) CopyTo(_ context.Context, _, _ string) error { return nil } +``` + +Inject via `AgentConfig.Transport`: + +```go +cfg := &ml.AgentConfig{ + Transport: &fakeTransport{outputs: map[string]string{...}}, +} +``` + +### HTTP mock server + +For `HTTPBackend` tests, use `net/http/httptest`: + +```go +srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + json.NewEncoder(w).Encode(map[string]any{ + "choices": []map[string]any{ + {"message": map[string]string{"role": "assistant", "content": "hello"}}, + }, + }) +})) +defer srv.Close() +backend := ml.NewHTTPBackend(srv.URL, "test-model") +``` + +--- + +## Adding a New Backend + +A backend must implement `ml.Backend`: + +```go +type Backend interface { + Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) + Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error) + Name() string + Available() bool +} +``` + +### Steps + +1. Create `backend_{name}.go` in the package root. +2. Add the `// SPDX-Licence-Identifier: EUPL-1.2` header. +3. Add a compile-time interface check: + ```go + var _ Backend = (*MyBackend)(nil) + ``` +4. Implement `Generate` as a thin wrapper around `Chat` where possible (follows the pattern of `HTTPBackend`). +5. Create `backend_{name}_test.go` with `_Good`, `_Bad`, and interface-compliance tests. +6. Register the backend in `service.go`'s `OnStartup` if it warrants lifecycle management, or document that callers must register it via `Service.RegisterBackend`. + +### GPU backends + +If the backend wraps a `go-inference.TextModel` (e.g. a new hardware accelerator), use `InferenceAdapter` rather than re-implementing the polling/streaming logic: + +```go +m, err := myBackendPackage.LoadModel(modelPath) +if err != nil { + return nil, err +} +return ml.NewInferenceAdapter(m, "my-backend"), nil +``` + +--- + +## Adding a New Scoring Suite + +1. Add a new scoring function or type in a dedicated file (e.g. `my_suite.go`). +2. Add the suite name to `Engine.NewEngine`'s suite selection logic in `score.go`. +3. Add a result field to `PromptScore` in `types.go`. +4. Add the goroutine fan-out case in `Engine.ScoreAll` in `score.go`. +5. Add race condition tests in `score_race_test.go`. + +--- + +## Coding Standards + +### Language + +Use **UK English** throughout: colour, organisation, centre, licence (noun), authorise. The only exception is identifiers in external APIs that use American spellings — do not rename those. + +### File headers + +Every new file must begin with: + +```go +// SPDX-Licence-Identifier: EUPL-1.2 +``` + +### Strict types + +All parameters and return types must be explicitly typed. Avoid `interface{}` or `any` except at JSON unmarshalling boundaries. + +### Import grouping + +Three groups, each separated by a blank line: + +```go +import ( + "context" // stdlib + "fmt" + + "forge.lthn.ai/core/go-inference" // forge.lthn.ai modules + + "github.com/stretchr/testify/assert" // third-party +) +``` + +### Error wrapping + +Use `fmt.Errorf("context: %w", err)` for wrapping. Use `log.E("pkg.Type.Method", "what failed", err)` from the Core framework for structured error logging with stack context. + +### Concurrency + +- Protect shared maps with `sync.RWMutex` or `sync.Mutex` as appropriate. +- Use semaphore channels (buffered `chan struct{}`) to bound goroutine concurrency rather than `sync.Pool` or `errgroup` with fixed limits. +- Always check `model.Err()` after exhausting a `go-inference` token iterator — the iterator itself carries no error; the error is stored on the model. + +--- + +## Conventional Commits + +Use the following scopes: + +| Scope | When to use | +|-------|-------------| +| `backend` | Changes to any `backend_*.go` file or the `adapter.go` bridge | +| `scoring` | Changes to `score.go`, `heuristic.go`, `judge.go`, `exact.go` | +| `probes` | Changes to `probes.go` or capability probe definitions | +| `agent` | Changes to any `agent_*.go` file | +| `service` | Changes to `service.go` or `Options` | +| `types` | Changes to `types.go` or `inference.go` interfaces | +| `gguf` | Changes to `gguf.go` | + +Examples: + +``` +feat(backend): add ROCm backend via go-rocm InferenceAdapter +fix(scoring): handle nil ContentScores when content probe not found +refactor(agent): replace SSHCommand with SSHTransport.Run +test(probes): add Check function coverage for all 23 probes +``` + +--- + +## Co-Author and Licence + +Every commit must include: + +``` +Co-Authored-By: Virgil +``` + +The licence is **EUPL-1.2**. All source files carry the SPDX identifier in the header. Do not add licence headers to test files; the package-level declaration covers them. + +--- + +## Forge Remote + +The authoritative remote is `forge.lthn.ai/core/go-ml`: + +```bash +git push forge main +``` + +The SSH remote URL is `ssh://git@forge.lthn.ai:2223/core/go-ml.git`. HTTPS authentication is not configured — always push via SSH. diff --git a/docs/history.md b/docs/history.md new file mode 100644 index 0000000..7fe54d1 --- /dev/null +++ b/docs/history.md @@ -0,0 +1,194 @@ +# go-ml Project History + +## Origin: Extraction from go-ai (19 February 2026) + +go-ml began as the `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. The monolith had grown to approximately 14,000 LOC and 53% of that was the ML subsystem. The ML code had zero internal dependencies on the rest of go-ai — it imported only `go-mlx` (external) and the Core `go` framework. The extraction was therefore clean: lift the directory, adjust the module path, and update the one import in go-ai that referenced it. + +**What was extracted:** + +- 41 Go source files (~7,494 LOC, excluding tests) +- 6 test files covering backends, heuristic, judge, exact, probes, and score +- All InfluxDB, DuckDB, Parquet, GGUF, and agent code + +**After extraction:** + +- go-ai dropped from ~14,000 to ~3,400 LOC (the `ai/` facade and `mcp/` hub remain there) +- go-ml became an independent module at `forge.lthn.ai/core/go-ml` + +--- + +## Phase 1: go-inference Migration (Complete) + +**Commit range:** `c3c2c14` (initial fix) through adapter and reverse adapter work. + +**Problem:** The original `backend_mlx.go` imported subpackages from go-mlx (`go-mlx/cache`, `go-mlx/model`, `go-mlx/sample`, `go-mlx/tokenizer`) that no longer existed after go-mlx's Phase 4 refactoring. The file was 253 LOC of hand-rolled tokenisation, KV cache management, sampling loops, and memory cleanup — and none of it compiled. + +**Solution:** Introduce `go-inference` as the abstraction layer between go-ml and hardware backends. + +### Step 1.1 — Add go-inference dependency + +Added `forge.lthn.ai/core/go-inference` to `go.mod` with a `replace` directive pointing to the local sibling checkout. + +### Step 1.2 — Write InferenceAdapter (`adapter.go`) + +Created `InferenceAdapter`, which wraps a `go-inference.TextModel` (returning `iter.Seq[Token]`) and exposes it as `ml.Backend` + `ml.StreamingBackend` (returning strings / calling `TokenCallback`). Thirteen test cases verified token collection, streaming, callback error propagation, context cancellation, empty output, and model errors after partial generation. + +Key design decision: after exhausting the iterator, `model.Err()` is checked separately. The iterator itself does not carry errors; partial output is returned alongside the error so callers can decide whether to use or discard it. + +### Step 1.3 — Rewrite `backend_mlx.go` + +Replaced 253 LOC with approximately 35 LOC. The blank import `_ "forge.lthn.ai/core/go-mlx"` registers the Metal backend via go-mlx's `init()`. `inference.LoadModel()` then handles model loading, and `InferenceAdapter` handles the rest. + +Memory controls (cache limits, memory limits) were deferred: go-mlx handles them internally, and callers that need explicit control can call `mlx.SetCacheLimit()` directly. + +### Step 1.4 — Reverse adapters (`backend_http_textmodel.go`) + +Added `HTTPTextModel` and `LlamaTextModel`, which wrap the existing `ml.Backend` implementations to satisfy `inference.TextModel`. This enables HTTP and llama-server backends to be used in packages (go-ai, go-i18n) that consume the go-inference interface. Since HTTP backends return complete strings rather than streaming tokens, each response is yielded as a single `Token`. + +17 tests for `HTTPTextModel` and 2 for `LlamaTextModel` all pass. + +### Step 1.5 — Downstream verification + +Confirmed that `service.go` (`Backend.Generate()`), `judge.go` (`judgeChat()`), and `go-ai/mcp/tools_ml.go` (`ml.Service`) required no changes — `InferenceAdapter` satisfies `ml.Backend`, and the existing consumers are unaffected. + +--- + +## Phase 2: Backend Consolidation (Complete) + +**Commit range:** `747e703` (Message unification) through `convertOpts` extension. + +**Audit (Virgil, 20 February 2026):** Only two files in the entire ecosystem call `StreamingBackend` methods: `host-uk/cli/cmd/ml/cmd_serve.go` (SSE streaming at `/v1/completions` and `/v1/chat/completions`) and `cmd/ml/cmd_chat.go` (interactive terminal token echo). All other consumers use `Backend.Generate()` only. + +### Step 2.1 — Unify Message types + +`ml.Message` was a separate struct identical to `inference.Message`. Replaced with a type alias: + +```go +type Message = inference.Message +``` + +This eliminated the `convertMessages()` helper from `adapter.go` and all explicit conversion sites. Backward-compatible: all existing callers continue to use `ml.Message` and compile unchanged. + +### Step 2.2 — Extend GenOpts + +Added `TopK`, `TopP`, and `RepeatPenalty` to `ml.GenOpts` to match the fields available in `inference.GenerateConfig`. Updated `convertOpts()` in `adapter.go` to map the new fields. Existing callers that only set `Temperature`, `MaxTokens`, and `Model` continue to work unchanged. + +**Field type note:** `inference.GenerateConfig` uses `float32` for temperature and sampling fields; `ml.GenOpts` uses `float64` to match the conventions in the rest of go-ml. `convertOpts()` performs the narrowing conversion explicitly. + +### Step 2.3 — Deprecate StreamingBackend + +Added deprecation comment to `StreamingBackend` in `inference.go`. The interface is not removed because `host-uk/cli` depends on it. Migration of those CLI files is out of scope for go-ml. + +### Step 2.4 — Document backend architecture + +Added the "Backend Architecture" section to `CLAUDE.md` documenting the two interface families, adapter directions, and migration guidance. + +--- + +## Phase 3: Agent Loop Modernisation (Complete) + +The original `agent.go` was a 1,070 LOC file mixing SSH commands, InfluxDB line protocol construction, probe evaluation, checkpoint discovery, and JSONL buffering. It had zero tests. + +### Step 3.1 — Split into five files (Commit `eae9ec9`) + +| File | LOC | Contents | +|------|-----|---------| +| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` | +| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `GetScoredLabels`, `FindUnscored`, `ProcessOne`, `isMLXNative` | +| `agent_eval.go` | 397 | `processMLXNative`, `processWithConversion`, `RunCapabilityProbes`, `RunCapabilityProbesFull`, `RunContentProbes`, `ProbeResult` types | +| `agent_influx.go` | 291 | `ScoreCapabilityAndPush`, `ScoreContentAndPush`, `PushCapability*`, `BufferInfluxResult`, `ReplayInfluxBuffer` | +| `agent_ssh.go` | 102 | `SSHCommand`, `SCPFrom`, `SCPTo`, `fileBase`, `EnvOr`, `IntEnvOr`, `ExpandHome` | + +`go build ./...`, `go test ./...`, and `go vet ./...` all passed after the split. + +### Step 3.2 — Abstract SSH transport (Commit `1c2a6a6`) + +Introduced the `RemoteTransport` interface with `Run`, `CopyFrom`, and `CopyTo` methods. `SSHTransport` implements this interface using the system `ssh` and `scp` binaries with functional options (`WithPort`, `WithTimeout`). `AgentConfig.Transport` accepts any `RemoteTransport`, with lazy initialisation to an `SSHTransport` when nil. + +The old package-level functions `SSHCommand`, `SCPFrom`, and `SCPTo` are retained as deprecated wrappers that delegate to `AgentConfig.Transport`. + +### Step 3.3 — Extract hardcoded infrastructure (Commit `12f3a1c`) + +Extracted 15 constants from scattered magic values across 7 files: + +- `EpochBase` — InfluxDB timestamp origin (Unix timestamp for 15 February 2025 00:00 UTC) +- Five InfluxDB measurement names (`MeasurementCapabilityScore`, `MeasurementCapabilityJudge`, `MeasurementContentScore`, `MeasurementProbeScore`, `MeasurementTrainingLoss`) +- Two DuckDB table names (`TableCheckpointScores`, `TableProbeResults`) +- Probe evaluation defaults (`CapabilityTemperature`, `CapabilityMaxTokens`, `ContentTemperature`, `ContentMaxTokens`, `MaxStoredResponseLen`) +- `InfluxBufferFile` — JSONL buffer filename +- `LogSeparatorWidth` — banner line width + +Hardcoded probe counts replaced with `len(CapabilityProbes)` and `len(ContentProbes)`. + +### Step 3.4 — Agent tests (Commit `3e22761`) + +First test coverage for the agent subsystem: + +- `AdapterMeta()` — 8 tests: known families (12 entries), variant suffixes, subdirectory patterns, unknown fallback, no-prefix edge case +- `FindUnscored()` — 5 tests: all unscored (sorted), some scored, all scored, empty input, nil scored map +- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — 4 tests: JSONL round-trip, multiple entries, empty file, missing file +- `DiscoverCheckpoints()` — 6 tests using `fakeTransport`: 3 checkpoints across 2 dirs, subdirectory pattern, no adapters, SSH error, filter pattern, no safetensors files + +--- + +## Phase 4: Test Coverage (Complete, Commit `09bf403`) + +Added four test files covering previously untested areas: + +**`backend_llama_test.go`** (20 tests) — Uses `net/http/httptest` to mock the llama-server HTTP API. Covers: `Name`, `Available` (4 variants including process-not-started and health endpoint failure), `Generate` (6 variants including context cancellation, empty choices, and opts forwarding), `Chat` (3 variants), `Stop`, constructor (4 variants), and interface compliance. + +**`backend_mlx_test.go`** (8 tests) — Uses a mock `inference.TextModel`. No build tag required — tests run on all platforms without Metal GPU hardware. Covers: `Generate`, `Chat`, streaming, model error after partial output, `Close`, direct model access via `Model()`, interface compliance, and `convertOpts` field mapping. + +**`score_race_test.go`** (6 tests) — Race condition tests run with `-race`: +- `ConcurrentSemantic` — 20 responses scored with concurrency=4; verifies no data races on the result map +- `ConcurrentMixedSuites` — semantic + standard + content fan-out simultaneously +- `SemaphoreBoundary` — concurrency=1; verifies that at most 1 goroutine holds the semaphore at once +- `ContextCancellation` — 400 error response from judge returns nil semantic score without panicking +- `HeuristicOnlyNoRace` — 50 responses, heuristic only (no goroutines spawned); regression check +- `MultiModelConcurrent` — 4 models × 5 concurrent goroutines writing to the results map + +**`benchmark_test.go`** (25 benchmarks, baselines on M3 Ultra): +- `HeuristicScore` — 5 input sizes (100–10,000 characters): 25µs–8.8ms +- `ExactMatch` — 4 patterns: 171ns–2.1µs +- `JudgeExtractJSON` — 6 response variants: 2.5–3.4µs +- `Judge` round-trip — 2 suites (semantic, content): ~52µs +- `ScoreAll` — 2 modes (heuristic only, full): 25µs–4.5ms +- Sub-components — 5 heuristic stages: 244ns–88µs + +--- + +## Known Limitations + +### StreamingBackend retention + +`ml.StreamingBackend` cannot be removed until `host-uk/cli/cmd/ml/cmd_serve.go` and `cmd/ml/cmd_chat.go` are migrated to use `inference.TextModel` iterators directly. That migration is out of scope for go-ml and must be tracked in the `host-uk/cli` repository. + +### LlamaTextModel streaming gap + +`LlamaTextModel` implements `inference.TextModel` but does not actually stream tokens — it yields the complete llama-server HTTP response as a single `Token`. True token-level streaming from llama-server would require implementing SSE parsing, which is a separate effort. + +### Agent infrastructure coupling + +`AgentConfig` contains fields (`M3Host`, `M3User`, `M3SSHKey`, `M3AdapterBase`, `InfluxURL`, `InfluxDB`) that are tightly coupled to a specific deployment topology (M3 Mac + InfluxDB on `10.69.69.165`). While the `RemoteTransport` abstraction decouples tests from SSH, production deployments still hardcode the M3 as the checkpoint host. + +### EpochBase timestamp + +The `EpochBase` constant (`1739577600`, corresponding to 15 February 2025 00:00 UTC) is embedded in InfluxDB line protocol timestamps. All capability/content/probe timestamps derive from this base plus checkpoint iteration offsets. Changing `EpochBase` would require re-writing all historical InfluxDB data. + +### HTTPBackend classify + +`HTTPTextModel.Classify` returns an "unsupported" error. There is no path to add classification support to an OpenAI-compatible HTTP backend without a dedicated classification endpoint or prompt engineering. + +### DuckDB CGo + +The `go-duckdb` dependency requires CGo. This prevents cross-compilation from macOS to Linux without a cross-compilation toolchain. Binaries that import go-ml will require a C compiler at build time. + +--- + +## Future Considerations + +- **ROCm backend** — `go-rocm` provides a llama-server subprocess backend for AMD GPUs. Once published, it can be wrapped with `InferenceAdapter` in the same pattern as `backend_mlx.go`, gated with a `//go:build linux && amd64` constraint. +- **StreamingBackend removal** — Once `host-uk/cli` is migrated to `iter.Seq[Token]`, the `StreamingBackend` interface and `InferenceAdapter`'s `GenerateStream`/`ChatStream` methods can be removed. +- **go-i18n integration** — go-i18n Phase 2a requires 5,000 sentences/second classification throughput from Gemma3-1B. The `InferenceAdapter` and `inference.TextModel.BatchGenerate` provide the interface; the performance target depends on go-mlx's batching implementation. +- **LEM Lab pipeline wiring** — Integration tests for `backend_mlx.go` with a real model are deferred until the LEM Lab inference pipeline is fully wired. A smoke test against a small quantised model would confirm end-to-end Metal GPU inference through the go-inference abstraction. +- **Charm SSH** — The `SSHTransport` currently shells out to the system `ssh` and `scp` binaries. Replacing these with pure-Go SSH via `charmbracelet/keygen` and a native SSH client would eliminate the subprocess dependency and improve testability.