diff --git a/CLAUDE.md b/CLAUDE.md index 8c26483..11c8227 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,72 +1,162 @@ -# CLAUDE.md +# CLAUDE.md — go-ml Domain Expert Guide -## What This Is +You are a dedicated domain expert for `forge.lthn.ai/core/go-ml`. Virgil (in core/go) orchestrates your work via TODO.md. Pick up tasks in phase order, mark `[x]` when done, commit and push. -ML inference backends, scoring engine, and agent orchestrator. Module: `forge.lthn.ai/core/go-ml` +## What This Package Does -Provides pluggable inference backends (MLX/Metal, llama.cpp, HTTP/Ollama), a multi-suite scoring engine with ethics-aware probes, GGUF model management, and a concurrent worker pipeline for batch evaluation. +ML inference backends, scoring engine, and agent orchestrator. 7.5K LOC across 41 Go files. Provides: + +- **Pluggable inference backends** — MLX/Metal (darwin/arm64), llama.cpp (subprocess), HTTP/Ollama (OpenAI-compatible) +- **Multi-suite scoring engine** — Heuristic (regex), semantic (LLM judge), content (sovereignty probes), standard benchmarks (TruthfulQA, DoNotAnswer, Toxigen, GSM8K) +- **23 capability probes** — Binary pass/fail tests across 16 categories (math, logic, code, etc.) +- **GGUF model management** — Format parsing, conversion, inventory +- **Agent orchestrator** — SSH checkpoint discovery, InfluxDB streaming, batch evaluation + +## Critical Context: go-inference Migration + +**This is the #1 priority.** Phase 1 in TODO.md. + +The package currently defines its own `Backend` interface that returns `(string, error)`. The shared `go-inference` package defines `TextModel` which returns `iter.Seq[Token]` (Go 1.23+ range-over-func). Everything downstream is blocked until go-ml bridges these two interfaces. + +### Interface Gap + +``` +go-ml (CURRENT) go-inference (TARGET) +───────────────── ───────────────────── +Backend.Generate(ctx, prompt, GenOpts) TextModel.Generate(ctx, prompt, ...GenerateOption) + → (string, error) → iter.Seq[Token] + +Backend.Chat(ctx, messages, GenOpts) TextModel.Chat(ctx, messages, ...GenerateOption) + → (string, error) → iter.Seq[Token] + +StreamingBackend.GenerateStream( (streaming is built-in via iter.Seq) + ctx, prompt, opts, TokenCallback) + → error + +GenOpts{Temperature, MaxTokens, Model} GenerateConfig{MaxTokens, Temperature, + TopK, TopP, StopTokens, RepeatPenalty} + (configured via WithMaxTokens(n) etc.) +``` + +### What the Adapter Must Do + +```go +// InferenceAdapter wraps go-inference.TextModel to satisfy ml.Backend + ml.StreamingBackend. +// This is the bridge between the new iterator-based API and the legacy string-return API. +type InferenceAdapter struct { + model inference.TextModel +} + +// Generate collects all tokens from the iterator into a string. +func (a *InferenceAdapter) Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) { + genOpts := convertOpts(opts) // GenOpts → []inference.GenerateOption + var buf strings.Builder + for tok := range a.model.Generate(ctx, prompt, genOpts...) { + buf.WriteString(tok.Text) + } + if err := a.model.Err(); err != nil { + return buf.String(), err + } + return buf.String(), nil +} + +// GenerateStream yields tokens to the callback as they arrive. +func (a *InferenceAdapter) GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error { + genOpts := convertOpts(opts) + for tok := range a.model.Generate(ctx, prompt, genOpts...) { + if err := cb(tok.Text); err != nil { + return err + } + } + return a.model.Err() +} +``` + +### backend_mlx.go Is Broken + +After go-mlx Phase 4, the old subpackage imports no longer exist: +- `forge.lthn.ai/core/go-mlx/cache` — **REMOVED** (now `internal/metal`) +- `forge.lthn.ai/core/go-mlx/model` — **REMOVED** (now `internal/metal`) +- `forge.lthn.ai/core/go-mlx/sample` — **REMOVED** (now `internal/metal`) +- `forge.lthn.ai/core/go-mlx/tokenizer` — **REMOVED** (now `internal/metal`) + +The new go-mlx public API is: +```go +import ( + "forge.lthn.ai/core/go-inference" + _ "forge.lthn.ai/core/go-mlx" // registers "metal" backend via init() +) + +m, err := inference.LoadModel("/path/to/model/", inference.WithContextLen(4096)) +defer m.Close() +for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) { + fmt.Print(tok.Text) +} +``` + +**The rewrite**: Delete the 253 LOC of manual tokenisation/KV cache/sampling. Replace with ~60 LOC that loads via go-inference and wraps in `InferenceAdapter`. ## Commands ```bash -go test ./... # Run all tests +go test ./... # Run all tests (some will fail until Phase 1) go test -v -run TestHeuristic # Single test -go test -bench=. ./... # Benchmarks +go test -bench=. ./... # Benchmarks (none exist yet) +go test -race ./... # Race detector +go vet ./... # Static analysis ``` +**Note**: `backend_mlx.go` won't compile until rewritten (Phase 1). Use build tags to skip: +```bash +go test -tags '!darwin' ./... # Skip MLX tests on non-darwin +``` + +## Local Dependencies + +All resolve via `replace` directives in go.mod: + +| Module | Local Path | Notes | +|--------|-----------|-------| +| `forge.lthn.ai/core/go` | `../host-uk/core` | Framework (ServiceRuntime, process, log) | +| `forge.lthn.ai/core/go-mlx` | `../go-mlx` | Metal GPU backend (darwin/arm64 only) | +| `forge.lthn.ai/core/go-inference` | `../go-inference` | Shared TextModel/Backend interfaces | + ## Architecture ### Backends (pluggable inference) -| File | Backend | Notes | -|------|---------|-------| -| `backend_mlx.go` | MLX/Metal GPU | Native Apple Silicon via go-mlx (darwin/arm64 only) | -| `backend_llama.go` | llama.cpp | GGUF models via subprocess | -| `backend_http.go` | HTTP API | Generic (Ollama, vLLM, OpenAI-compatible) | -| `ollama.go` | Ollama helpers | Ollama-specific client utilities | +| File | Backend | Status | +|------|---------|--------| +| `backend_mlx.go` | MLX/Metal GPU | **BROKEN** — old imports, needs Phase 1 rewrite | +| `backend_llama.go` | llama-server subprocess | Works, needs go-inference wrapper | +| `backend_http.go` | HTTP API (OpenAI-compatible) | Works, needs go-inference wrapper | +| `ollama.go` | Ollama helpers | Works | ### Scoring Engine -| File | Purpose | -|------|---------| -| `score.go` | Main scoring orchestrator | -| `heuristic.go` | Fast rule-based scoring (no LLM needed) | -| `judge.go` | LLM-as-judge evaluator | -| `exact.go` | Exact match scoring (GSM8K-style) | -| `probes.go` | Ethics-aware evaluation probes | +| File | LOC | Purpose | +|------|-----|---------| +| `score.go` | 212 | Concurrent scoring orchestrator (semaphore-bounded workers) | +| `heuristic.go` | 258 | 9 regex-based metrics, LEK composite score | +| `judge.go` | 205 | LLM-as-judge (6 scoring methods) | +| `exact.go` | 77 | GSM8K exact-match with numeric extraction | +| `probes.go` | 273 | 23 binary capability probes across 16 categories | ### Data Pipeline -| File | Purpose | -|------|---------| -| `agent.go` (1,070 LOC) | LLM agent orchestrator (largest file) | -| `worker.go` | Concurrent worker pool for multi-model scoring | -| `ingest.go` | Bulk data ingestion | -| `import_all.go` | Import orchestration | -| `gguf.go` | GGUF model handling and inventory | -| `convert.go` | Model format conversion | -| `db.go` | DuckDB storage layer | -| `parquet.go` | Parquet I/O | +| File | LOC | Purpose | +|------|-----|---------| +| `agent.go` | 1,070 | Scoring agent (SSH checkpoint discovery, InfluxDB) | +| `worker.go` | 403 | LEM API worker for distributed inference | +| `service.go` | 162 | Core framework integration (lifecycle, backend registry) | +| `ingest.go` | 384 | JSONL response loading | +| `db.go` | 258 | DuckDB analytics storage | +| `gguf.go` | 369 | GGUF model format parsing | -### Monitoring - -| File | Purpose | -|------|---------| -| `metrics.go` | Metrics tracking | -| `influx.go` | InfluxDB integration | -| `status.go` | Status reporting | - -## Dependencies - -- `forge.lthn.ai/core/go` — Framework (ServiceRuntime, process, log) -- `forge.lthn.ai/core/go-mlx` — Native Metal GPU inference -- `github.com/marcboeker/go-duckdb` — Embedded analytics DB -- `github.com/parquet-go/parquet-go` — Columnar data format - -## Key Interfaces +### Key Types ```go -// Backend — pluggable inference +// Current backend interface (inference.go) type Backend interface { Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error) @@ -74,23 +164,39 @@ type Backend interface { Available() bool } -// StreamingBackend — extends Backend with token streaming type StreamingBackend interface { Backend GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error } + +type GenOpts struct { + Temperature float64 + MaxTokens int + Model string +} + +type Message struct { + Role string `json:"role"` + Content string `json:"content"` +} ``` ## Coding Standards -- UK English -- Tests: testify assert/require -- Conventional commits -- Co-Author: `Co-Authored-By: Virgil ` -- Licence: EUPL-1.2 +- **UK English**: colour, organisation, centre +- **Tests**: testify assert/require (existing), Pest-style names welcome for new tests +- **Conventional commits**: `feat(backend):`, `fix(scoring):`, `refactor(mlx):` +- **Co-Author**: `Co-Authored-By: Virgil ` +- **Licence**: EUPL-1.2 +- **Imports**: stdlib → forge.lthn.ai → third-party, each group separated by blank line + +## Forge + +- **Repo**: `forge.lthn.ai/core/go-ml` +- **Push via SSH**: `git push forge main` (remote: `ssh://git@forge.lthn.ai:2223/core/go-ml.git`) ## Task Queue -See `TODO.md` for prioritised work. -See `FINDINGS.md` for research notes. +See `TODO.md` for prioritised work. Phase 1 (go-inference migration) is the critical path. +See `FINDINGS.md` for research notes and interface mapping. diff --git a/FINDINGS.md b/FINDINGS.md index 5b0aafd..6ec4742 100644 --- a/FINDINGS.md +++ b/FINDINGS.md @@ -15,6 +15,7 @@ Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/c ### Dependencies - `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only) +- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1) - `forge.lthn.ai/core/go` — Framework services, process management, logging - `github.com/marcboeker/go-duckdb` — Analytics storage - `github.com/parquet-go/parquet-go` — Columnar data I/O @@ -22,79 +23,108 @@ Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/c ### Consumers -- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools -- `go-ai/test-mlx.go` — Integration test utility +- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`) - LEM Lab — Uses MLXBackend for chat inference +- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference) -## Architecture +## go-inference Interface Mapping -### Backend Interface +### Type Correspondence +| go-ml | go-inference | Notes | +|-------|-------------|-------| +| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel | +| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming | +| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. | +| `ml.Message` | `inference.Message` | Identical struct: Role + Content | +| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks | +| (no equivalent) | `inference.Token` | `{ID int32, Text string}` | +| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] | + +### Method Mapping + +``` +ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error) + ↕ InferenceAdapter collects tokens +inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token] + +ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error + ↕ InferenceAdapter forwards tokens to callback +inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token] + +ml.GenOpts{Temperature: 0.7, MaxTokens: 2048} + ↕ convertOpts helper +inference.WithTemperature(0.7), inference.WithMaxTokens(2048) +``` + +### backend_mlx.go Before/After + +**Before** (253 LOC — BROKEN, old subpackage imports): ```go -type Backend interface { - Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) - Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error) - Name() string - Available() bool -} +import ( + "forge.lthn.ai/core/go-mlx" + "forge.lthn.ai/core/go-mlx/cache" // REMOVED + "forge.lthn.ai/core/go-mlx/model" // REMOVED + "forge.lthn.ai/core/go-mlx/sample" // REMOVED + "forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED +) -type StreamingBackend interface { - Backend - GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error - ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error +type MLXBackend struct { + model model.Model + tok *tokenizer.Tokenizer + caches []cache.Cache + sampler sample.Sampler + // ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup } ``` -Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based. +**After** (~60 LOC — uses go-inference + InferenceAdapter): +```go +import ( + "forge.lthn.ai/core/go-inference" + _ "forge.lthn.ai/core/go-mlx" // registers "metal" backend via init() +) -### Scoring Engine +func NewMLXBackend(modelPath string) (*InferenceAdapter, error) { + m, err := inference.LoadModel(modelPath) + if err != nil { return nil, fmt.Errorf("mlx: %w", err) } + return &InferenceAdapter{model: m, name: "mlx"}, nil +} +``` -Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results. +All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface. -**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity. +## Scoring Engine Architecture -**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality. +### 5 Suites -**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness. +| Suite | Method | LLM needed? | Metrics | +|-------|--------|-------------|---------| +| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite | +| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) | +| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) | +| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen | +| **Exact** | Numeric extraction | No | GSM8K answer matching | -**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning). +### LEK Score Formula -**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance. +``` +LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5 + - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20 +``` -### 23 Capability Probes +Positive signals: engagement depth, creative form, emotional register, first-person voice. +Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output. -16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought. +### Concurrency Model -### InfluxDB Integration - -- Endpoint: `10.69.69.165:8181` -- Database: `training` -- Protocol: Line protocol writes (hand-rolled, no official client) -- Purpose: Streaming checkpoint scores during agent evaluation runs - -### Data Pipeline - -DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion. - -## go-inference Gap - -This is the critical finding driving Phase 1. - -**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`. - -**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func). - -**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means: -1. MLX backend duplicates logic that go-inference already provides -2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface -3. Scoring engine is locked to the legacy string-return interface - -**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md. +`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex. ## Known Issues -- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration. -- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate. -- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable. -- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends. -- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests. +- **backend_mlx.go imports dead subpackages** — Blocked on Phase 1 migration +- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together +- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go +- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists +- **score.go concurrency untested** — No race condition tests +- **Message type duplication** — `ml.Message` and `inference.Message` are identical but separate diff --git a/TODO.md b/TODO.md index 0c2cd89..6b3afaa 100644 --- a/TODO.md +++ b/TODO.md @@ -1,46 +1,132 @@ # TODO.md — go-ml Task Queue -## Phase 1: go-inference Migration - -The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC. - -- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`. -- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams. -- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens. -- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface. - -## Phase 2: Backend Consolidation - -`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy. - -- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`. -- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`. -- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase. -- [ ] **Unify GenOpts** — `ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers. - -## Phase 3: Agent Loop Modernisation - -`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files. - -- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming). -- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable. -- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library. -- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants. - -## Phase 4: Test Coverage - -`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests. - -- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle. -- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors. -- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score. -- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently. +Dispatched from Virgil in core/go. Pick up tasks in phase order. --- -## Standing: Workflow +## Phase 1: go-inference Migration (CRITICAL PATH) + +Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring. + +### Step 1.1: Add go-inference dependency + +- [ ] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves. + +### Step 1.2: Write the InferenceAdapter + +- [ ] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement: + - `Generate()` — collect tokens from iterator into string + - `Chat()` — same, using `TextModel.Chat()` + - `GenerateStream()` — forward tokens to `TokenCallback` + - `ChatStream()` — same for chat + - `Name()` — delegate to `TextModel.ModelType()` + - `Available()` — always true (model already loaded) + - `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options + + **Key mapping**: + ``` + GenOpts.Temperature → inference.WithTemperature(float32(t)) + GenOpts.MaxTokens → inference.WithMaxTokens(n) + GenOpts.Model → (ignored, model already loaded) + ``` + + **Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled). + +- [ ] **Test adapter.go** — Test with a mock `inference.TextModel` that yields predetermined tokens. Test cases: + - Normal generation (collect tokens → string) + - Streaming (each token hits callback) + - Callback error stops iteration + - Context cancellation propagates + - Empty output (EOS immediately) + - Model error after partial output + +### Step 1.3: Rewrite backend_mlx.go + +- [ ] **Replace backend_mlx.go** — Delete the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replace with ~60 LOC: + ```go + //go:build darwin && arm64 + + package ml + + import ( + "forge.lthn.ai/core/go-inference" + _ "forge.lthn.ai/core/go-mlx" // registers "metal" backend + ) + + func NewMLXBackend(modelPath string) (*InferenceAdapter, error) { + m, err := inference.LoadModel(modelPath) + if err != nil { + return nil, fmt.Errorf("mlx: %w", err) + } + return &InferenceAdapter{model: m, name: "mlx"}, nil + } + ``` + The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic. + +- [ ] **Preserve memory controls** — The old `MLXBackend` set cache/memory limits (16GB/24GB). These should be configurable. Options: + - Accept memory limits in `NewMLXBackend` params + - Or set them in `InferenceAdapter` wrapper + - go-mlx exposes `SetCacheLimit()` / `SetMemoryLimit()` at package level + +- [ ] **Test backend_mlx.go** — Verify the new backend can: + - Load a model via go-inference registry + - Generate text (smoke test, requires model on disk) + - Stream tokens via callback + - Handle Metal availability check (build tag gating) + +### Step 1.4: HTTPBackend and LlamaBackend wrappers + +- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend already works fine as `ml.Backend`. For go-inference compatibility, write a thin wrapper that implements `inference.TextModel`: + - `Generate()` calls HTTP API, yields entire response as single Token + - `Chat()` same + - This is lower priority than MLX — HTTP backends don't need the full iter.Seq pattern + - Consider SSE streaming: `/v1/chat/completions` with `"stream": true` returns SSE events that CAN be yielded as `iter.Seq[Token]` + +- [ ] **LlamaBackend go-inference wrapper** — LlamaBackend delegates to HTTPBackend already. Same treatment. + +### Step 1.5: Verify downstream consumers + +- [ ] **Service.Generate() still works** — `service.go` calls `Backend.Generate()`. After migration, backends wrapped in `InferenceAdapter` must still satisfy `ml.Backend`. +- [ ] **Judge still works** — `judge.go` uses `Backend.Generate()` for LLM-as-judge. Verify scoring pipeline runs end-to-end. +- [ ] **go-ai tools_ml.go** — Uses `ml.Service` directly. No code changes needed in go-ai if `ml.Backend` interface is preserved. + +--- + +## Phase 2: Backend Consolidation + +After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile. + +- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate to `iter.Seq[Token]`. +- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. +- [ ] **Unify GenOpts** — `ml.GenOpts` and `inference.GenerateConfig` overlap. Add `convertOpts()` in Phase 1, consolidate into one struct later. +- [ ] **Unify Message types** — `ml.Message` and `inference.Message` are identical structs. Consider type alias or shared import. + +--- + +## Phase 3: Agent Loop Modernisation + +`agent.go` (1,070 LOC) is the largest file. Decompose. + +- [ ] **Split agent.go** — Into: `agent_config.go` (config, model maps), `agent_execute.go` (run loop, checkpoint processing), `agent_eval.go` (probe evaluation, result publishing), `agent_influx.go` (InfluxDB streaming, JSONL buffer). +- [ ] **Abstract SSH transport** — Extract SSH checkpoint discovery into interface. Current M3 homelab SSH may change to Linux (go-rocm). +- [ ] **Configurable endpoints** — `10.69.69.165:8181` and M3 SSH details hardcoded. Move to config/environment. +- [ ] **InfluxDB client** — Hand-rolled line protocol. Evaluate official InfluxDB Go client. + +--- + +## Phase 4: Test Coverage + +- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, health checks, process lifecycle. +- [ ] **backend_mlx_test.go** — After Phase 1 rewrite, test with mock go-inference TextModel. +- [ ] **score.go race tests** — `go test -race ./...`. Concurrent scoring, semaphore boundaries, context cancellation. +- [ ] **Benchmark suite** — `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. + +--- + +## Workflow 1. Virgil in core/go writes tasks here after research 2. This repo's session picks up tasks in phase order 3. Mark `[x]` when done, note commit hash -4. Phase 1 is the critical path — everything else builds on go-inference migration +4. New discoveries → add tasks, note in FINDINGS.md +5. Push to forge after each completed step: `git push forge main` diff --git a/go.mod b/go.mod index fe442b7..b306255 100644 --- a/go.mod +++ b/go.mod @@ -31,6 +31,8 @@ require ( golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da // indirect ) -replace forge.lthn.ai/core/go => ../core +replace forge.lthn.ai/core/go => ../host-uk/core replace forge.lthn.ai/core/go-mlx => ../go-mlx + +replace forge.lthn.ai/core/go-inference => ../go-inference