All 4 steps done in commit 747e703:
- Step 2.1: Message type alias (inference.go, adapter.go)
- Step 2.2: GenOpts extension with TopK/TopP/RepeatPenalty
- Step 2.3: StreamingBackend deprecation
- Step 2.4: CLAUDE.md Backend Architecture section
Co-Authored-By: Virgil <virgil@lethean.io>
179 lines
8.6 KiB
Markdown
179 lines
8.6 KiB
Markdown
# TODO.md — go-ml Task Queue
|
|
|
|
Dispatched from Virgil in core/go. Pick up tasks in phase order.
|
|
|
|
---
|
|
|
|
## Phase 1: go-inference Migration (CRITICAL PATH)
|
|
|
|
Everything downstream is blocked on this. The old `backend_mlx.go` imports go-mlx subpackages that no longer exist after Phase 4 refactoring.
|
|
|
|
### Step 1.1: Add go-inference dependency
|
|
|
|
- [x] **Add `forge.lthn.ai/core/go-inference` to go.mod** — Already has a `replace` directive pointing to `../go-inference`. Run `go get forge.lthn.ai/core/go-inference` then `go mod tidy`. Verify the module resolves.
|
|
|
|
### Step 1.2: Write the InferenceAdapter
|
|
|
|
- [x] **Create `adapter.go`** — Bridge between `go-inference.TextModel` (returns `iter.Seq[Token]`) and `ml.Backend` + `ml.StreamingBackend` (returns `string`/callback). Must implement:
|
|
- `Generate()` — collect tokens from iterator into string
|
|
- `Chat()` — same, using `TextModel.Chat()`
|
|
- `GenerateStream()` — forward tokens to `TokenCallback`
|
|
- `ChatStream()` — same for chat
|
|
- `Name()` — delegate to `TextModel.ModelType()`
|
|
- `Available()` — always true (model already loaded)
|
|
- `convertOpts(GenOpts) []inference.GenerateOption` — map `GenOpts` fields to functional options
|
|
|
|
**Key mapping**:
|
|
```
|
|
GenOpts.Temperature → inference.WithTemperature(float32(t))
|
|
GenOpts.MaxTokens → inference.WithMaxTokens(n)
|
|
GenOpts.Model → (ignored, model already loaded)
|
|
```
|
|
|
|
**Error handling**: After the iterator completes, check `model.Err()` to distinguish EOS from errors (OOM, ctx cancelled).
|
|
|
|
- [x] **Test adapter.go** — 13 test cases with mock TextModel (all pass). Test cases:
|
|
- Normal generation (collect tokens → string)
|
|
- Streaming (each token hits callback)
|
|
- Callback error stops iteration
|
|
- Context cancellation propagates
|
|
- Empty output (EOS immediately)
|
|
- Model error after partial output
|
|
|
|
### Step 1.3: Rewrite backend_mlx.go
|
|
|
|
- [x] **Replace backend_mlx.go** — Deleted the 253 LOC that manually handle tokenisation, KV cache, sampling, and memory cleanup. Replaced with ~35 LOC:
|
|
```go
|
|
//go:build darwin && arm64
|
|
|
|
package ml
|
|
|
|
import (
|
|
"forge.lthn.ai/core/go-inference"
|
|
_ "forge.lthn.ai/core/go-mlx" // registers "metal" backend
|
|
)
|
|
|
|
func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
|
|
m, err := inference.LoadModel(modelPath)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("mlx: %w", err)
|
|
}
|
|
return &InferenceAdapter{model: m, name: "mlx"}, nil
|
|
}
|
|
```
|
|
The `InferenceAdapter` from Step 1.2 handles all the Generate/Chat/Stream logic.
|
|
|
|
- [ ] **Preserve memory controls** — The old `MLXBackend` set cache/memory limits (16GB/24GB). Now delegated to go-mlx internally. Callers can still use `mlx.SetCacheLimit()`/`mlx.SetMemoryLimit()` directly. Options for future:
|
|
- Accept memory limits in `NewMLXBackend` params
|
|
- Or set them in `InferenceAdapter` wrapper
|
|
- go-mlx exposes `SetCacheLimit()` / `SetMemoryLimit()` at package level
|
|
|
|
- [ ] **Test backend_mlx.go** — Verify the new backend can:
|
|
- Load a model via go-inference registry
|
|
- Generate text (smoke test, requires model on disk)
|
|
- Stream tokens via callback
|
|
- Handle Metal availability check (build tag gating)
|
|
|
|
### Step 1.4: HTTPBackend and LlamaBackend wrappers
|
|
|
|
- [x] **HTTPBackend go-inference wrapper** — `backend_http_textmodel.go`: `HTTPTextModel` wraps `HTTPBackend` to implement `inference.TextModel`. Generate/Chat yield entire response as single Token. Classify returns unsupported error. BatchGenerate processes prompts sequentially. 17 tests pass.
|
|
|
|
- [x] **LlamaBackend go-inference wrapper** — `backend_http_textmodel.go`: `LlamaTextModel` embeds `HTTPTextModel`, overrides `ModelType()` -> "llama" and `Close()` -> `llama.Stop()`. 2 tests pass.
|
|
|
|
### Step 1.5: Verify downstream consumers
|
|
|
|
- [x] **Service.Generate() still works** — `service.go` calls `Backend.Generate()`. InferenceAdapter satisfies ml.Backend. HTTPBackend/LlamaBackend still implement ml.Backend directly. No changes needed.
|
|
- [x] **Judge still works** — `judge.go` calls `Backend.Generate()` via `judgeChat()`. Same Backend contract, works as before. No changes needed.
|
|
- [x] **go-ai tools_ml.go** — Uses `ml.Service` directly. `ml.Backend` interface is preserved, no code changes needed in go-ai.
|
|
|
|
---
|
|
|
|
## Phase 2: Backend Consolidation
|
|
|
|
After Phase 1, both `ml.Backend` (string) and `inference.TextModel` (iterator) coexist. Reconcile.
|
|
|
|
### Audit Results (Virgil, 20 Feb 2026)
|
|
|
|
**StreamingBackend callers** — Only 2 files in `host-uk/cli`:
|
|
- `cmd/ml/cmd_serve.go` lines 146,201,319: Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming at `/v1/completions` and `/v1/chat/completions`
|
|
- `cmd/ml/cmd_chat.go`: Direct `ChatStream()` call for interactive terminal token echo
|
|
|
|
All other consumers (service.go, judge.go, agent.go, expand.go, go-ai tools_ml.go) use `Backend.Generate()` — NOT streaming.
|
|
|
|
**Backend implementations**:
|
|
- `InferenceAdapter` → implements Backend + StreamingBackend (via go-inference iter.Seq)
|
|
- `HTTPBackend` → implements Backend only (no streaming)
|
|
- `LlamaBackend` → implements Backend only (no streaming)
|
|
|
|
### Step 2.1: Unify Message types
|
|
|
|
- [x] **Type alias ml.Message → inference.Message** — In `inference.go`, replace the `Message` struct with:
|
|
```go
|
|
type Message = inference.Message
|
|
```
|
|
This is backward-compatible — all existing callers keep working. Remove the `convertMessages()` helper from `adapter.go` since types are now identical. Verify with `go build ./...` and `go test ./...`.
|
|
|
|
### Step 2.2: Unify GenOpts
|
|
|
|
- [x] **Add inference fields to GenOpts** — Extend `ml.GenOpts` to include the extra fields from `inference.GenerateConfig`:
|
|
```go
|
|
type GenOpts struct {
|
|
Temperature float64
|
|
MaxTokens int
|
|
Model string // override model for this request
|
|
TopK int // NEW: from inference.GenerateConfig
|
|
TopP float64 // NEW: from inference.GenerateConfig (float64 to match Temperature)
|
|
RepeatPenalty float64 // NEW: from inference.GenerateConfig
|
|
}
|
|
```
|
|
Update `convertOpts()` in adapter.go to map the new fields. Existing callers that only set Temperature/MaxTokens/Model continue working unchanged.
|
|
|
|
### Step 2.3: Deprecate StreamingBackend
|
|
|
|
- [x] **Mark StreamingBackend as deprecated** — Add deprecation comment:
|
|
```go
|
|
// Deprecated: StreamingBackend is retained for backward compatibility.
|
|
// New code should use inference.TextModel with iter.Seq[Token] directly.
|
|
// See InferenceAdapter for the bridge pattern.
|
|
type StreamingBackend interface { ... }
|
|
```
|
|
Do NOT remove yet — `host-uk/cli` cmd_serve.go and cmd_chat.go still depend on it. Those migrations are out of scope for go-ml (they live in a different repo).
|
|
|
|
### Step 2.4: Document migration path
|
|
|
|
- [x] **Update CLAUDE.md** — Add "Backend Architecture" section documenting:
|
|
- `inference.TextModel` (iterator-based) is the preferred API for new code
|
|
- `ml.Backend` (string-based) is the compatibility layer, still supported
|
|
- `StreamingBackend` is deprecated, use `iter.Seq[Token]` directly
|
|
- `InferenceAdapter` bridges TextModel → Backend/StreamingBackend
|
|
- `HTTPTextModel`/`LlamaTextModel` bridges Backend → TextModel (reverse direction)
|
|
|
|
---
|
|
|
|
## Phase 3: Agent Loop Modernisation
|
|
|
|
`agent.go` (1,070 LOC) is the largest file. Decompose.
|
|
|
|
- [ ] **Split agent.go** — Into: `agent_config.go` (config, model maps), `agent_execute.go` (run loop, checkpoint processing), `agent_eval.go` (probe evaluation, result publishing), `agent_influx.go` (InfluxDB streaming, JSONL buffer).
|
|
- [ ] **Abstract SSH transport** — Extract SSH checkpoint discovery into interface. Current M3 homelab SSH may change to Linux (go-rocm).
|
|
- [ ] **Configurable endpoints** — `10.69.69.165:8181` and M3 SSH details hardcoded. Move to config/environment.
|
|
- [ ] **InfluxDB client** — Hand-rolled line protocol. Evaluate official InfluxDB Go client.
|
|
|
|
---
|
|
|
|
## Phase 4: Test Coverage
|
|
|
|
- [ ] **backend_llama_test.go** — Mock llama-server subprocess. Test: model loading, health checks, process lifecycle.
|
|
- [ ] **backend_mlx_test.go** — After Phase 1 rewrite, test with mock go-inference TextModel.
|
|
- [ ] **score.go race tests** — `go test -race ./...`. Concurrent scoring, semaphore boundaries, context cancellation.
|
|
- [ ] **Benchmark suite** — `BenchmarkHeuristic`, `BenchmarkJudge`, `BenchmarkExact` for various input sizes.
|
|
|
|
---
|
|
|
|
## Workflow
|
|
|
|
1. Virgil in core/go writes tasks here after research
|
|
2. This repo's session picks up tasks in phase order
|
|
3. Mark `[x]` when done, note commit hash
|
|
4. New discoveries → add tasks, note in FINDINGS.md
|
|
5. Push to forge after each completed step: `git push forge main`
|