docs: add TODO.md, FINDINGS.md for fleet delegation
Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
parent
7c8e762a86
commit
88e926cc24
2 changed files with 105 additions and 25 deletions
85
FINDINGS.md
85
FINDINGS.md
|
|
@ -4,16 +4,13 @@
|
|||
|
||||
### Origin
|
||||
|
||||
Extracted from `forge.lthn.ai/core/go-ai/ml/`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
|
||||
Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.
|
||||
|
||||
### What Was Extracted
|
||||
|
||||
- 41 Go files (~7,494 LOC excluding tests)
|
||||
- 6 test files (backend_http, exact, heuristic, judge, probes, score)
|
||||
|
||||
### Key Finding: Heaviest Package
|
||||
|
||||
ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
|
||||
- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).
|
||||
|
||||
### Dependencies
|
||||
|
||||
|
|
@ -21,6 +18,7 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4
|
|||
- `forge.lthn.ai/core/go` — Framework services, process management, logging
|
||||
- `github.com/marcboeker/go-duckdb` — Analytics storage
|
||||
- `github.com/parquet-go/parquet-go` — Columnar data I/O
|
||||
- `github.com/stretchr/testify` — Test assertions
|
||||
|
||||
### Consumers
|
||||
|
||||
|
|
@ -28,12 +26,75 @@ ml/ is 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4
|
|||
- `go-ai/test-mlx.go` — Integration test utility
|
||||
- LEM Lab — Uses MLXBackend for chat inference
|
||||
|
||||
### Architecture Note: agent.go
|
||||
## Architecture
|
||||
|
||||
At 1,070 LOC, agent.go is the largest file. It orchestrates:
|
||||
- Multi-model scoring runs
|
||||
- Remote M3 infrastructure scheduling
|
||||
- Ethics-aware probe evaluation
|
||||
- Result consolidation and publishing
|
||||
### Backend Interface
|
||||
|
||||
This file is a decomposition candidate but functional as-is.
|
||||
```go
|
||||
type Backend interface {
|
||||
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
|
||||
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
|
||||
Name() string
|
||||
Available() bool
|
||||
}
|
||||
|
||||
type StreamingBackend interface {
|
||||
Backend
|
||||
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
|
||||
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
|
||||
}
|
||||
```
|
||||
|
||||
Key design: `Backend.Generate` returns `string`, not `iter.Seq[Token]`. `StreamingBackend` adds token callbacks but is still callback-based, not iterator-based.
|
||||
|
||||
### Scoring Engine
|
||||
|
||||
Concurrent scoring with semaphore-bounded workers. `Engine` fans out suites across goroutines, collects results.
|
||||
|
||||
**Heuristic suite** (9 metrics): refusal detection, length ratio, repetition, coherence, instruction following, format compliance, language match, confidence calibration, response diversity.
|
||||
|
||||
**Semantic suite** (4 dimensions): LLM-as-judge scoring across helpfulness, accuracy, harmlessness, and reasoning quality.
|
||||
|
||||
**Content suite** (6 probes): sovereignty probes testing model behaviour on sensitive topics — political bias, cultural sensitivity, factual grounding, source attribution, opinion vs fact distinction, regional awareness.
|
||||
|
||||
**Standard suite** (4 benchmarks): TruthfulQA (truthfulness), DoNotAnswer (safety refusals), Toxigen (toxicity detection), GSM8K (mathematical reasoning).
|
||||
|
||||
**Exact suite** (GSM8K numeric): Extracts numeric answers from model output and compares against ground truth with tolerance.
|
||||
|
||||
### 23 Capability Probes
|
||||
|
||||
16 categories covering: reasoning, mathematics, coding, instruction following, multilingual, summarisation, creative writing, factual recall, safety, ethics, roleplay, context length, tool use, multimodal description, structured output, and chain-of-thought.
|
||||
|
||||
### InfluxDB Integration
|
||||
|
||||
- Endpoint: `10.69.69.165:8181`
|
||||
- Database: `training`
|
||||
- Protocol: Line protocol writes (hand-rolled, no official client)
|
||||
- Purpose: Streaming checkpoint scores during agent evaluation runs
|
||||
|
||||
### Data Pipeline
|
||||
|
||||
DuckDB for local analytics storage, Parquet for columnar I/O, InfluxDB for time-series streaming. GGUF converter handles MLX LoRA to GGUF tensor name mapping for model format conversion.
|
||||
|
||||
## go-inference Gap
|
||||
|
||||
This is the critical finding driving Phase 1.
|
||||
|
||||
**go-ml has**: `ml.Backend` interface where `Generate` returns `(string, error)`. Callback-based streaming via `StreamingBackend`.
|
||||
|
||||
**go-inference has**: `TextModel` interface where `Generate` returns `iter.Seq[Token]`. Iterator-based streaming (Go 1.23+ range-over-func).
|
||||
|
||||
**Gap**: No adapter between the two. `backend_mlx.go` imports go-mlx directly (~253 LOC of manual tokenisation, KV cache, sampling) instead of using go-inference which wraps all of that. This means:
|
||||
1. MLX backend duplicates logic that go-inference already provides
|
||||
2. Other backends (HTTP, Llama) cannot benefit from go-inference's unified interface
|
||||
3. Scoring engine is locked to the legacy string-return interface
|
||||
|
||||
**Solution**: Write `InferenceAdapter` bridging `go-inference.TextModel` to `ml.Backend`, then rewrite `backend_mlx.go` to use go-inference. This is Phase 1 in TODO.md.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- **backend_mlx.go imports go-mlx directly** — Should go through go-inference. ~253 LOC that collapses to ~60 LOC after migration.
|
||||
- **agent.go is too large** — 1,070 LOC handling SSH, InfluxDB, scoring orchestration, and result publishing. Decomposition candidate.
|
||||
- **Hardcoded infrastructure** — InfluxDB endpoint (`10.69.69.165:8181`), M3 SSH details baked into agent.go. Should be configurable.
|
||||
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists for backends.
|
||||
- **score.go concurrency untested** — Semaphore-bounded worker pool has no race condition tests.
|
||||
|
|
|
|||
45
TODO.md
45
TODO.md
|
|
@ -1,27 +1,46 @@
|
|||
# TODO.md — go-ml Task Queue
|
||||
|
||||
## Phase 1: Post-Split Hardening
|
||||
## Phase 1: go-inference Migration
|
||||
|
||||
- [ ] **Verify tests pass standalone** — Run `go test ./...`. Confirm all 6 test files pass (backend_http, exact, heuristic, judge, probes, score).
|
||||
- [ ] **agent.go audit** — 1,070 LOC is the largest file. Review for decomposition opportunities. May benefit from splitting into agent_config.go, agent_execute.go, agent_eval.go.
|
||||
- [ ] **Backend interface docs** — Add godoc examples showing how to implement a custom Backend.
|
||||
The big one. `backend_mlx.go` needs rewriting to use `go-inference.TextModel` instead of direct go-mlx imports. This collapses ~253 LOC to ~60 LOC.
|
||||
|
||||
## Phase 2: Scoring Improvements
|
||||
- [ ] **Rewrite backend_mlx.go** — Replace direct go-mlx calls with go-inference TextModel. The current implementation manually handles tokenisation, KV cache, sampling, and token decoding. go-inference wraps all of that behind `TextModel.Generate()` returning `iter.Seq[Token]`.
|
||||
- [ ] **HTTPBackend go-inference wrapper** — HTTPBackend should implement `go-inference.Backend` or wrap it. Currently returns `(string, error)` from Generate; needs an adapter that yields `iter.Seq[Token]` from SSE streams.
|
||||
- [ ] **LlamaBackend go-inference wrapper** — Same treatment as HTTPBackend. llama-server already supports SSE streaming; the adapter reads the stream and yields tokens.
|
||||
- [ ] **Bridge ml.Backend to go-inference** — The old `ml.Backend` interface (`Generate` returns `string`, not `iter.Seq[Token]`) needs a bridging adapter. Write `InferenceAdapter` that wraps `go-inference.TextModel` and collects tokens into a string for the legacy interface.
|
||||
|
||||
- [ ] **Benchmark scoring suites** — No benchmarks exist. Add: BenchmarkHeuristic, BenchmarkJudge, BenchmarkExact for various input sizes.
|
||||
- [ ] **Probe coverage** — Audit probes.go for completeness against OWASP LLM Top 10 and ethics guidelines.
|
||||
- [ ] **Scoring pipeline metrics** — Track time-per-suite, pass/fail rates, aggregated scores over time.
|
||||
## Phase 2: Backend Consolidation
|
||||
|
||||
## Phase 3: Backend Enhancements
|
||||
`StreamingBackend` vs `go-inference.TextModel` overlap. Reconcile: go-inference is the standard, `ml.Backend` is legacy.
|
||||
|
||||
- [ ] **Backend registry** — Currently backends are created ad-hoc. Add a registry pattern for discovery and configuration.
|
||||
- [ ] **Health checks** — Backends should expose health status (model loaded, GPU available, API reachable).
|
||||
- [ ] **Retry with backoff** — HTTP backend should retry on transient failures with exponential backoff.
|
||||
- [ ] **Audit StreamingBackend usage** — Find all callers of `GenerateStream`/`ChatStream`. Determine which can migrate directly to `iter.Seq[Token]`.
|
||||
- [ ] **Migration path** — Keep both interfaces temporarily. Add `BackendAdapter` that wraps go-inference.TextModel and satisfies both `ml.Backend` and `StreamingBackend`.
|
||||
- [ ] **Deprecate StreamingBackend** — Once all callers use go-inference iterators, mark StreamingBackend as deprecated. Remove in a later phase.
|
||||
- [ ] **Unify GenOpts** — `ml.GenOpts` and `go-inference.GenerateOptions` likely overlap. Consolidate into one options struct or add conversion helpers.
|
||||
|
||||
## Phase 3: Agent Loop Modernisation
|
||||
|
||||
`agent.go` (1,070 LOC) is the largest file. SSH checkpoint discovery, InfluxDB streaming. Needs splitting into smaller files.
|
||||
|
||||
- [ ] **Split agent.go** — Decompose into: `agent_config.go` (SSH/infra config), `agent_execute.go` (scoring run orchestration), `agent_eval.go` (result evaluation and publishing), `agent_influx.go` (InfluxDB streaming).
|
||||
- [ ] **Abstract SSH transport** — M3 homelab SSH may change to Linux. Extract SSH checkpoint discovery into an interface so the transport layer is swappable.
|
||||
- [ ] **InfluxDB client modernisation** — Current line protocol writes are hand-rolled. Evaluate using the official InfluxDB Go client library.
|
||||
- [ ] **Configurable endpoints** — Hardcoded `10.69.69.165:8181` and M3 SSH details should come from config/environment, not constants.
|
||||
|
||||
## Phase 4: Test Coverage
|
||||
|
||||
`backend_http_test` exists but `backend_llama` and `backend_mlx` have no tests. `score.go` concurrency needs race condition tests.
|
||||
|
||||
- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, prompt formatting, streaming, error recovery, process lifecycle.
|
||||
- [ ] **backend_mlx_test.go** — Mock go-mlx (or go-inference after Phase 1). Test: darwin/arm64 gating, Metal availability check, generation flow, tokeniser errors.
|
||||
- [ ] **score.go race tests** — Run `go test -race ./...`. Add concurrent scoring tests: multiple suites running simultaneously, semaphore boundary conditions, context cancellation mid-score.
|
||||
- [ ] **Benchmark suite** — Add `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes. No benchmarks exist currently.
|
||||
|
||||
---
|
||||
|
||||
## Workflow
|
||||
## Standing: Workflow
|
||||
|
||||
1. Virgil in core/go writes tasks here after research
|
||||
2. This repo's session picks up tasks in phase order
|
||||
3. Mark `[x]` when done, note commit hash
|
||||
4. Phase 1 is the critical path — everything else builds on go-inference migration
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue