Based on comprehensive codebase analysis: - Only 2 files use StreamingBackend (both in host-uk/cli) - All other consumers use Backend.Generate() (non-streaming) - Precise steps: type alias Message, extend GenOpts, deprecate StreamingBackend Co-Authored-By: Virgil <virgil@lethean.io>
8.6 KiB
TODO.md — go-ml Task Queue
Dispatched from Virgil in core/go. Pick up tasks in phase order.
Phase 1: go-inference Migration (CRITICAL PATH)
Everything downstream is blocked on this. The old backend_mlx.go imports go-mlx subpackages that no longer exist after Phase 4 refactoring.
Step 1.1: Add go-inference dependency
- Add
forge.lthn.ai/core/go-inferenceto go.mod — Already has areplacedirective pointing to../go-inference. Rungo get forge.lthn.ai/core/go-inferencethengo mod tidy. Verify the module resolves.
Step 1.2: Write the InferenceAdapter
-
Create
adapter.go— Bridge betweengo-inference.TextModel(returnsiter.Seq[Token]) andml.Backend+ml.StreamingBackend(returnsstring/callback). Must implement:Generate()— collect tokens from iterator into stringChat()— same, usingTextModel.Chat()GenerateStream()— forward tokens toTokenCallbackChatStream()— same for chatName()— delegate toTextModel.ModelType()Available()— always true (model already loaded)convertOpts(GenOpts) []inference.GenerateOption— mapGenOptsfields to functional options
Key mapping:
GenOpts.Temperature → inference.WithTemperature(float32(t)) GenOpts.MaxTokens → inference.WithMaxTokens(n) GenOpts.Model → (ignored, model already loaded)Error handling: After the iterator completes, check
model.Err()to distinguish EOS from errors (OOM, ctx cancelled). -
Test adapter.go — 13 test cases with mock TextModel (all pass). Test cases:
- Normal generation (collect tokens → string)
- Streaming (each token hits callback)
- Callback error stops iteration
- Context cancellation propagates
- Empty output (EOS immediately)
- Model error after partial output
Step 1.3: Rewrite backend_mlx.go
-
Replace backend_mlx.go — Deleted the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replaced with ~35 LOC:
//go:build darwin && arm64 package ml import ( "forge.lthn.ai/core/go-inference" _ "forge.lthn.ai/core/go-mlx" // registers "metal" backend ) func NewMLXBackend(modelPath string) (*InferenceAdapter, error) { m, err := inference.LoadModel(modelPath) if err != nil { return nil, fmt.Errorf("mlx: %w", err) } return &InferenceAdapter{model: m, name: "mlx"}, nil }The
InferenceAdapterfrom Step 1.2 handles all the Generate/Chat/Stream logic. -
Preserve memory controls — The old
MLXBackendset cache/memory limits (16GB/24GB). Now delegated to go-mlx internally. Callers can still usemlx.SetCacheLimit()/mlx.SetMemoryLimit()directly. Options for future:- Accept memory limits in
NewMLXBackendparams - Or set them in
InferenceAdapterwrapper - go-mlx exposes
SetCacheLimit()/SetMemoryLimit()at package level
- Accept memory limits in
-
Test backend_mlx.go — Verify the new backend can:
- Load a model via go-inference registry
- Generate text (smoke test, requires model on disk)
- Stream tokens via callback
- Handle Metal availability check (build tag gating)
Step 1.4: HTTPBackend and LlamaBackend wrappers
-
HTTPBackend go-inference wrapper —
backend_http_textmodel.go:HTTPTextModelwrapsHTTPBackendto implementinference.TextModel. Generate/Chat yield entire response as single Token. Classify returns unsupported error. BatchGenerate processes prompts sequentially. 17 tests pass. -
LlamaBackend go-inference wrapper —
backend_http_textmodel.go:LlamaTextModelembedsHTTPTextModel, overridesModelType()-> "llama" andClose()->llama.Stop(). 2 tests pass.
Step 1.5: Verify downstream consumers
- Service.Generate() still works —
service.gocallsBackend.Generate(). InferenceAdapter satisfies ml.Backend. HTTPBackend/LlamaBackend still implement ml.Backend directly. No changes needed. - Judge still works —
judge.gocallsBackend.Generate()viajudgeChat(). Same Backend contract, works as before. No changes needed. - go-ai tools_ml.go — Uses
ml.Servicedirectly.ml.Backendinterface is preserved, no code changes needed in go-ai.
Phase 2: Backend Consolidation
After Phase 1, both ml.Backend (string) and inference.TextModel (iterator) coexist. Reconcile.
Audit Results (Virgil, 20 Feb 2026)
StreamingBackend callers — Only 2 files in host-uk/cli:
cmd/ml/cmd_serve.golines 146,201,319: Type-assertsbackend.(ml.StreamingBackend)for SSE streaming at/v1/completionsand/v1/chat/completionscmd/ml/cmd_chat.go: DirectChatStream()call for interactive terminal token echo
All other consumers (service.go, judge.go, agent.go, expand.go, go-ai tools_ml.go) use Backend.Generate() — NOT streaming.
Backend implementations:
InferenceAdapter→ implements Backend + StreamingBackend (via go-inference iter.Seq)HTTPBackend→ implements Backend only (no streaming)LlamaBackend→ implements Backend only (no streaming)
Step 2.1: Unify Message types
- Type alias ml.Message → inference.Message — In
inference.go, replace theMessagestruct with:
This is backward-compatible — all existing callers keep working. Remove thetype Message = inference.MessageconvertMessages()helper fromadapter.gosince types are now identical. Verify withgo build ./...andgo test ./....
Step 2.2: Unify GenOpts
- Add inference fields to GenOpts — Extend
ml.GenOptsto include the extra fields frominference.GenerateConfig:
Updatetype GenOpts struct { Temperature float64 MaxTokens int Model string // override model for this request TopK int // NEW: from inference.GenerateConfig TopP float64 // NEW: from inference.GenerateConfig (float64 to match Temperature) RepeatPenalty float64 // NEW: from inference.GenerateConfig }convertOpts()in adapter.go to map the new fields. Existing callers that only set Temperature/MaxTokens/Model continue working unchanged.
Step 2.3: Deprecate StreamingBackend
- Mark StreamingBackend as deprecated — Add deprecation comment:
Do NOT remove yet —// Deprecated: StreamingBackend is retained for backward compatibility. // New code should use inference.TextModel with iter.Seq[Token] directly. // See InferenceAdapter for the bridge pattern. type StreamingBackend interface { ... }host-uk/clicmd_serve.go and cmd_chat.go still depend on it. Those migrations are out of scope for go-ml (they live in a different repo).
Step 2.4: Document migration path
- Update CLAUDE.md — Add "Backend Architecture" section documenting:
inference.TextModel(iterator-based) is the preferred API for new codeml.Backend(string-based) is the compatibility layer, still supportedStreamingBackendis deprecated, useiter.Seq[Token]directlyInferenceAdapterbridges TextModel → Backend/StreamingBackendHTTPTextModel/LlamaTextModelbridges Backend → TextModel (reverse direction)
Phase 3: Agent Loop Modernisation
agent.go (1,070 LOC) is the largest file. Decompose.
- Split agent.go — Into:
agent_config.go(config, model maps),agent_execute.go(run loop, checkpoint processing),agent_eval.go(probe evaluation, result publishing),agent_influx.go(InfluxDB streaming, JSONL buffer). - Abstract SSH transport — Extract SSH checkpoint discovery into interface. Current M3 homelab SSH may change to Linux (go-rocm).
- Configurable endpoints —
10.69.69.165:8181and M3 SSH details hardcoded. Move to config/environment. - InfluxDB client — Hand-rolled line protocol. Evaluate official InfluxDB Go client.
Phase 4: Test Coverage
- backend_llama_test.go — Mock llama-server subprocess. Test: model loading, health checks, process lifecycle.
- backend_mlx_test.go — After Phase 1 rewrite, test with mock go-inference TextModel.
- score.go race tests —
go test -race ./.... Concurrent scoring, semaphore boundaries, context cancellation. - Benchmark suite —
BenchmarkHeuristic,BenchmarkJudge,BenchmarkExactfor various input sizes.
Workflow
- Virgil in core/go writes tasks here after research
- This repo's session picks up tasks in phase order
- Mark
[x]when done, note commit hash - New discoveries → add tasks, note in FINDINGS.md
- Push to forge after each completed step:
git push forge main