From 88e926cc24ea0d0282a2b0ceef520e5723af0eb9 Mon Sep 17 00:00:00 2001 From: Virgil Date: Thu, 19 Feb 2026 21:33:53 +0000 Subject: [PATCH] docs: add TODO.md, FINDINGS.md for fleet delegation Co-Authored-By: Virgil --- FINDINGS.md | 85 +++++++++++++++++++++++++++++++++++++++++++++-------- TODO.md | 45 ++++++++++++++++++++-------- 2 files changed, 105 insertions(+), 25 deletions(-) diff --git a/FINDINGS.md b/FINDINGS.md index 8434b60..5b0aafd 100644 --- a/FINDINGS.md +++ b/FINDINGS.md @@ -4,16 +4,13 @@ ### Origin -Extracted from `forge.lthn.ai/core/go-ai/ml/`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only. +Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only. ### What Was Extracted - 41 Go files (~7,494 LOC excluding tests) - 6 test files (backend_http, exact, heuristic, judge, probes, score) - -### Key Finding: Heaviest Package - -ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub). +- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub). ### Dependencies @@ -21,6 +18,7 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4 - `forge.lthn.ai/core/go` — Framework services, process management, logging - `github.com/marcboeker/go-duckdb` — Analytics storage - `github.com/parquet-go/parquet-go` — Columnar data I/O +- `github.com/stretchr/testify` — Test assertions ### Consumers @@ -28,12 +26,75 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4 - `go-ai/test-mlx.go` — Integration test utility - LEM Lab — Uses MLXBackend for chat inference -### Architecture Note: agent.go +## Architecture -At 1,070 LOC, agent.go is the largest file. It orchestrates: -- Multi-model scoring runs -- Remote M3 infrastructure scheduling -- Ethics-aware probe evaluation -- Result consolidation and publishing +### Backend Interface -This file is a decomposition candidate but functional as-is. +```go +type Backend interface { + Generate(ctx context.Context, prompt string, opts GenOpts) (string, error) + Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error) + Name() string + Available() bool +} + +type StreamingBackend interface { + Backend + GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error + ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error +} +``` + +Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based. + +### Scoring Engine + +Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results. + +**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity. + +**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality. + +**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness. + +**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning). + +**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance. + +### 23 Capability Probes + +16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought. + +### InfluxDB Integration + +- Endpoint: `10.69.69.165:8181` +- Database: `training` +- Protocol: Line protocol writes (hand-rolled, no official client) +- Purpose: Streaming checkpoint scores during agent evaluation runs + +### Data Pipeline + +DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion. + +## go-inference Gap + +This is the critical finding driving Phase 1. + +**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`. + +**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func). + +**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means: +1. MLX backend duplicates logic that go-inference already provides +2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface +3. Scoring engine is locked to the legacy string-return interface + +**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md. + +## Known Issues + +- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration. +- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate. +- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable. +- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends. +- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests. diff --git a/TODO.md b/TODO.md index fb794cb..0c2cd89 100644 --- a/TODO.md +++ b/TODO.md @@ -1,27 +1,46 @@ # TODO.md — go-ml Task Queue -## Phase 1: Post-Split Hardening +## Phase 1: go-inference Migration -- [ ] **Verify tests pass standalone** — Run `go test ./...`. Confirm all 6 test files pass (backend_http, exact, heuristic, judge, probes, score). -- [ ] **agent.go audit** — 1,070 LOC is the largest file. Review for decomposition opportunities. May benefit from splitting into agent_config.go, agent_execute.go, agent_eval.go. -- [ ] **Backend interface docs** — Add godoc examples showing how to implement a custom Backend. +The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC. -## Phase 2: Scoring Improvements +- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`. +- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams. +- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens. +- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface. -- [ ] **Benchmark scoring suites** — No benchmarks exist. Add: BenchmarkHeuristic, BenchmarkJudge, BenchmarkExact for various input sizes. -- [ ] **Probe coverage** — Audit probes.go for completeness against OWASP LLM Top 10 and ethics guidelines. -- [ ] **Scoring pipeline metrics** — Track time-per-suite, pass/fail rates, aggregated scores over time. +## Phase 2: Backend Consolidation -## Phase 3: Backend Enhancements +`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy. -- [ ] **Backend registry** — Currently backends are created ad-hoc. Add a registry pattern for discovery and configuration. -- [ ] **Health checks** — Backends should expose health status (model loaded, GPU available, API reachable). -- [ ] **Retry with backoff** — HTTP backend should retry on transient failures with exponential backoff. +- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`. +- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`. +- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase. +- [ ] **Unify GenOpts** — `ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers. + +## Phase 3: Agent Loop Modernisation + +`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files. + +- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming). +- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable. +- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library. +- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants. + +## Phase 4: Test Coverage + +`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests. + +- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle. +- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors. +- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score. +- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently. --- -## Workflow +## Standing: Workflow 1. Virgil in core/go writes tasks here after research 2. This repo's session picks up tasks in phase order 3. Mark `[x]` when done, note commit hash +4. Phase 1 is the critical path — everything else builds on go-inference migration