docs: add TODO.md, FINDINGS.md for fleet delegation

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Virgil 2026-02-19 21:33:53 +00:00 committed by Snider
parent 7c8e762a86
commit 88e926cc24
2 changed files with 105 additions and 25 deletions

View file

@ -4,16 +4,13 @@
### Origin
Extracted from `forge.lthn.ai/core/go-ai/ml/`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
### What Was Extracted
- 41 Go files (~7,494 LOC excluding tests)
- 6 test files (backend_http, exact, heuristic, judge, probes, score)
### Key Finding: Heaviest Package
ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
### Dependencies
@ -21,6 +18,7 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4
- `forge.lthn.ai/core/go` — Framework services, process management, logging
- `github.com/marcboeker/go-duckdb` — Analytics storage
- `github.com/parquet-go/parquet-go` — Columnar data I/O
- `github.com/stretchr/testify` — Test assertions
### Consumers
@ -28,12 +26,75 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4
- `go-ai/test-mlx.go` — Integration test utility
- LEM Lab — Uses MLXBackend for chat inference
### Architecture Note: agent.go
## Architecture
At 1,070 LOC, agent.go is the largest file. It orchestrates:
- Multi-model scoring runs
- Remote M3 infrastructure scheduling
- Ethics-aware probe evaluation
- Result consolidation and publishing
### Backend Interface
This file is a decomposition candidate but functional as-is.
```go
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
```
Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based.
### Scoring Engine
Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results.
**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity.
**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality.
**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness.
**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning).
**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance.
### 23 Capability Probes
16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought.
### InfluxDB Integration
- Endpoint: `10.69.69.165:8181`
- Database: `training`
- Protocol: Line protocol writes (hand-rolled, no official client)
- Purpose: Streaming checkpoint scores during agent evaluation runs
### Data Pipeline
DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion.
## go-inference Gap
This is the critical finding driving Phase 1.
**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`.
**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func).
**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means:
1. MLX backend duplicates logic that go-inference already provides
2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface
3. Scoring engine is locked to the legacy string-return interface
**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md.
## Known Issues
- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration.
- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate.
- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable.
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends.
- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests.

45
TODO.md
View file

@ -1,27 +1,46 @@
# TODO.md — go-ml Task Queue
## Phase 1: Post-Split Hardening
## Phase 1: go-inference Migration
- [ ] **Verify tests pass standalone** — Run `go test ./...`. Confirm all 6 test files pass (backend_http, exact, heuristic, judge, probes, score).
- [ ] **agent.go audit** — 1,070 LOC is the largest file. Review for decomposition opportunities. May benefit from splitting into agent_config.go, agent_execute.go, agent_eval.go.
- [ ] **Backend interface docs** — Add godoc examples showing how to implement a custom Backend.
The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC.
## Phase 2: Scoring Improvements
- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`.
- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams.
- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens.
- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface.
- [ ] **Benchmark scoring suites** — No benchmarks exist. Add: BenchmarkHeuristic, BenchmarkJudge, BenchmarkExact for various input sizes.
- [ ] **Probe coverage** — Audit probes.go for completeness against OWASP LLM Top 10 and ethics guidelines.
- [ ] **Scoring pipeline metrics** — Track time-per-suite, pass/fail rates, aggregated scores over time.
## Phase 2: Backend Consolidation
## Phase 3: Backend Enhancements
`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy.
- [ ] **Backend registry** — Currently backends are created ad-hoc. Add a registry pattern for discovery and configuration.
- [ ] **Health checks** — Backends should expose health status (model loaded, GPU available, API reachable).
- [ ] **Retry with backoff** — HTTP backend should retry on transient failures with exponential backoff.
- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`.
- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`.
- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase.
- [ ] **Unify GenOpts**`ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers.
## Phase 3: Agent Loop Modernisation
`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files.
- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming).
- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable.
- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library.
- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants.
## Phase 4: Test Coverage
`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests.
- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle.
- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors.
- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score.
- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently.
---
## Workflow
## Standing: Workflow
1. Virgil in core/go writes tasks here after research
2. This repo's session picks up tasks in phase order
3. Mark `[x]` when done, note commit hash
4. Phase 1 is the critical path — everything else builds on go-inference migration