docs: add TODO.md, FINDINGS.md for fleet delegation

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:33:53 +00:00 · 2026-02-19 21:33:53 +00:00 · 88e926cc24
commit 88e926cc24
parent 7c8e762a86
2 changed files with 105 additions and 25 deletions
--- a/FINDINGS.md
+++ b/FINDINGS.md
@ -4,16 +4,13 @@

 ### Origin

-Extracted from `forge.lthn.ai/core/go-ai/ml/`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
+Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.

 ### What Was Extracted

 - 41 Go files (~7,494 LOC excluding tests)
 - 6 test files (backend_http, exact, heuristic, judge, probes, score)
-
-### Key Finding: Heaviest Package
-
-ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
+- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).

 ### Dependencies

@ -21,6 +18,7 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4
 - `forge.lthn.ai/core/go` — Framework services, process management, logging
 - `github.com/marcboeker/go-duckdb` — Analytics storage
 - `github.com/parquet-go/parquet-go` — Columnar data I/O
+- `github.com/stretchr/testify` — Test assertions

 ### Consumers

@ -28,12 +26,75 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4
 - `go-ai/test-mlx.go` — Integration test utility
 - LEM Lab — Uses MLXBackend for chat inference

-### Architecture Note: agent.go
+## Architecture

-At 1,070 LOC, agent.go is the largest file. It orchestrates:
- Multi-model scoring runs
- Remote M3 infrastructure scheduling
- Ethics-aware probe evaluation
- Result consolidation and publishing
+### Backend Interface

-This file is a decomposition candidate but functional as-is.
+```go
+type Backend interface {
+    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
+    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
+    Name() string
+    Available() bool
+}
+
+type StreamingBackend interface {
+    Backend
+    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
+    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
+}
+```
+
+Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based.
+
+### Scoring Engine
+
+Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results.
+
+**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity.
+
+**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality.
+
+**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness.
+
+**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning).
+
+**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance.
+
+### 23 Capability Probes
+
+16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought.
+
+### InfluxDB Integration
+
+- Endpoint: `10.69.69.165:8181`
+- Database: `training`
+- Protocol: Line protocol writes (hand-rolled, no official client)
+- Purpose: Streaming checkpoint scores during agent evaluation runs
+
+### Data Pipeline
+
+DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion.
+
+## go-inference Gap
+
+This is the critical finding driving Phase 1.
+
+**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`.
+
+**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func).
+
+**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means:
+1. MLX backend duplicates logic that go-inference already provides
+2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface
+3. Scoring engine is locked to the legacy string-return interface
+
+**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md.
+
+## Known Issues
+
+- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration.
+- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate.
+- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable.
+- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends.
+- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests.
--- a/TODO.md
+++ b/TODO.md
@ -1,27 +1,46 @@
 # TODO.md — go-ml Task Queue

-## Phase 1: Post-Split Hardening
+## Phase 1: go-inference Migration

- [ ] **Verify tests pass standalone** — Run `go test ./...`. Confirm all 6 test files pass (backend_http, exact, heuristic, judge, probes, score).
- [ ] **agent.go audit** — 1,070 LOC is the largest file. Review for decomposition opportunities. May benefit from splitting into agent_config.go, agent_execute.go, agent_eval.go.
- [ ] **Backend interface docs** — Add godoc examples showing how to implement a custom Backend.
+The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC.

-## Phase 2: Scoring Improvements
+- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`.
+- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams.
+- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens.
+- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface.

- [ ] **Benchmark scoring suites** — No benchmarks exist. Add: BenchmarkHeuristic, BenchmarkJudge, BenchmarkExact for various input sizes.
- [ ] **Probe coverage** — Audit probes.go for completeness against OWASP LLM Top 10 and ethics guidelines.
- [ ] **Scoring pipeline metrics** — Track time-per-suite, pass/fail rates, aggregated scores over time.
+## Phase 2: Backend Consolidation

-## Phase 3: Backend Enhancements
+`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy.

- [ ] **Backend registry** — Currently backends are created ad-hoc. Add a registry pattern for discovery and configuration.
- [ ] **Health checks** — Backends should expose health status (model loaded, GPU available, API reachable).
- [ ] **Retry with backoff** — HTTP backend should retry on transient failures with exponential backoff.
+- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`.
+- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`.
+- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase.
+- [ ] **Unify GenOpts** — `ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers.
+
+## Phase 3: Agent Loop Modernisation
+
+`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files.
+
+- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming).
+- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable.
+- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library.
+- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants.
+
+## Phase 4: Test Coverage
+
+`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests.
+
+- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle.
+- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors.
+- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score.
+- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently.

 ---

-## Workflow
+## Standing: Workflow

 1. Virgil in core/go writes tasks here after research
 2. This repo's session picks up tasks in phase order
 3. Mark `[x]` when done, note commit hash
+4. Phase 1 is the critical path — everything else builds on go-inference migration