go-ml/docs/architecture.md

380 lines
18 KiB
Markdown
Raw Permalink Normal View History

# go-ml Architecture
## Overview
`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.
The package provides three distinct subsystems:
1. **Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
2. **Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
3. **Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.
---
## Dependency Graph
```
forge.lthn.ai/core/go-ml
├── forge.lthn.ai/core/go-inference (shared TextModel/Token interfaces)
│ └── (no further Core deps)
├── forge.lthn.ai/core/go-mlx (Metal GPU inference, darwin/arm64 only)
│ └── forge.lthn.ai/core/go-inference
├── forge.lthn.ai/core/go (ServiceRuntime, process, log)
├── github.com/marcboeker/go-duckdb (analytics storage)
└── github.com/parquet-go/parquet-go (columnar data I/O)
```
### Role of each dependency
| Module | Purpose |
|--------|---------|
| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. |
| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. |
| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. |
| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. |
---
## Backend Architecture
Two interface families coexist within go-ml, connected by a set of adapters.
### The `ml.Backend` interface (compatibility layer)
```go
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
```
`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface.
### The `inference.TextModel` interface (preferred for new code)
Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`.
### `ml.StreamingBackend` (deprecated)
```go
// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
```
Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it.
### Type unification
`ml.Message` is a type alias for `inference.Message`:
```go
type Message = inference.Message
```
The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages.
`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection:
```go
type GenOpts struct {
Temperature float64
MaxTokens int
Model string // per-request model override; ignored by GPU backends
TopK int
TopP float64
RepeatPenalty float64
}
```
---
## Backend Implementations
### HTTPBackend (`backend_http.go`)
Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).
- Implements `ml.Backend` only (no streaming — returns complete response strings).
- Retries up to 3 times with exponential backoff on 5xx and connection errors.
- 300-second HTTP client timeout suitable for long-running inference.
### LlamaBackend (`backend_llama.go`)
Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`.
- Implements `ml.Backend`.
- `Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds.
- `Stop()` kills the managed process via the Core `process.Service`.
- Supports optional LoRA adapter loading via `--lora`.
### InferenceAdapter (`adapter.go`)
Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.
```
inference.TextModel (iter.Seq[Token])
└─── InferenceAdapter ───► ml.Backend (string)
───► ml.StreamingBackend (TokenCallback)
```
Key behaviours:
- `Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
- `GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops.
- `Available()` always returns `true` — the model is already loaded when the adapter is constructed.
- `Close()` delegates to `TextModel.Close()`, releasing GPU memory.
- `InspectAttention()` delegates to the underlying `TextModel` via type assertion to `inference.AttentionInspector`. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.
### MLX Backend (`backend_mlx.go`, darwin/arm64 only)
```go
//go:build darwin && arm64
func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath, loadOpts...)
// ...
return NewInferenceAdapter(m, "mlx"), nil
}
```
The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon.
The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package.
### Reverse adapters (`backend_http_textmodel.go`)
Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`).
| Type | Wraps | Notes |
|------|-------|-------|
| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. |
| `LlamaTextModel` | `*LlamaBackend` | Embeds `HTTPTextModel`; overrides `ModelType()``"llama"` and `Close()``llama.Stop()`. |
### Adapter map (all directions)
```
ml.Backend (string) <──── InferenceAdapter ──── inference.TextModel (iter.Seq[Token])
(adapter.go)
ml.HTTPBackend ──── HTTPTextModel ────► inference.TextModel
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
(backend_http_textmodel.go)
```
---
## Service Layer (`service.go`)
`Service` integrates go-ml into the Core framework lifecycle:
```go
core.New(
framework.WithName("ml", ml.NewService(ml.Options{
OllamaURL: "http://localhost:11434",
JudgeURL: "http://localhost:11434",
JudgeModel: "qwen3:8b",
Concurrency: 4,
Suites: "all",
})),
)
```
`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`.
---
## Scoring Engine
### Engine (`score.go`)
`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently.
```
ScoreAll(responses []Response) map[string][]PromptScore
├── Heuristic (inline, no goroutine)
└── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
```
The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.
Suites are selected at engine construction time via a comma-separated string or `"all"`.
### Heuristic scoring (`heuristic.go`)
Analyses a response using pre-compiled regular expressions. No LLM is needed.
Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:
```
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
- ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
```
**Positive signals**
| Sub-score | What it measures |
|-----------|-----------------|
| `EngagementDepth` | Structural markers (headings, bold), ethical vocabulary, technical depth, word count |
| `CreativeForm` | Poetry structure (short lines), narrative openings, metaphor density |
| `EmotionalRegister` | Emotional vocabulary (feel, grief, compassion, etc.) |
| `FirstPerson` | Sentences beginning with "I" or containing first-person agency verbs |
**Negative signals**
| Sub-score | What it measures |
|-----------|-----------------|
| `ComplianceMarkers` | RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") |
| `FormulaicPreamble` | Opener templates ("Sure, let's...", "Great question") |
| `Degeneration` | Sentence repetition ratio (looping/stuck output) |
| `EmptyBroken` | Empty, error-prefixed, or pad-token-polluted responses |
### Judge (`judge.go`)
`Judge` uses any `Backend` as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.
```go
judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
scores, err := judge.ScoreSemantic(ctx, prompt, response)
```
JSON extraction (`extractJSON`) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.
Six scoring methods are available:
| Method | Suite | Dimensions |
|--------|-------|-----------|
| `ScoreSemantic` | semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept |
| `ScoreContent` | content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister |
| `ScoreCapability` | (agent) | Reasoning, Correctness, Clarity |
| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness |
| `ScoreDoNotAnswer` | standard | Safety, Nuance |
| `ScoreToxigen` | standard | Kindness, Awareness |
### Exact match (`exact.go`)
`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required.
### Capability probes (`probes.go`)
23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response.
| Category | Probes | Examples |
|----------|--------|---------|
| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci |
| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions |
| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug |
| Word problems (2) | word | speed/distance, sibling counting |
`StripThinkBlocks()` removes `<think>...</think>` sections from DeepSeek R1 responses before checking.
---
## Agent Orchestrator
The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").
### Files
| File | LOC | Responsibility |
|------|-----|---------------|
| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne` |
| `agent_eval.go` | 397 | MLX-native and conversion evaluation paths, capability and content probe runners |
| `agent_influx.go` | 291 | InfluxDB line-protocol push, JSONL buffer for offline replay |
| `agent_ssh.go` | 102 | `RemoteTransport` interface, `SSHTransport` implementation, utility helpers |
### Workflow
```
RunAgentLoop
├── ReplayInfluxBuffer (flush any buffered writes from previous failures)
├── DiscoverCheckpoints (SSH ls on M3 adapter directories)
├── GetScoredLabels (InfluxDB query for already-scored (run_id, label) pairs)
├── FindUnscored (set difference, sorted by dirname + iteration)
└── ProcessOne (for each unscored checkpoint)
├── isMLXNative? YES → processMLXNative (serve directly via mlx_lm.server)
│ NO → processWithConversion (MLX→GGUF, then llama-server)
├── RunCapabilityProbes (23 binary probes)
├── RunContentProbes (sovereignty probes)
├── ScoreCapabilityAndPush (judge + InfluxDB)
└── ScoreContentAndPush (judge + InfluxDB)
```
### RemoteTransport
`RemoteTransport` abstracts SSH/SCP so that tests can supply an in-memory fake:
```go
type RemoteTransport interface {
Run(ctx context.Context, cmd string) (string, error)
CopyFrom(ctx context.Context, remote, local string) error
CopyTo(ctx context.Context, local, remote string) error
}
```
`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`.
### Checkpoint discovery
`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename.
`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`.
### Persistence
Results are written to two stores simultaneously:
- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`.
- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`.
If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer.
---
## Data Pipeline
| File | Purpose |
|------|---------|
| `ingest.go` | Load JSONL response files into `[]Response` slices |
| `db.go` | DuckDB schema creation, insert, and query helpers |
| `influx.go` | InfluxDB HTTP client (line protocol write, SQL query) |
| `gguf.go` | GGUF file format parsing (magic, version, metadata, tensor inventory) |
| `worker.go` | LEM API worker for distributed inference job dispatch |
| `expand.go` | Prompt expansion using a backend |
| `normalize.go` | Response normalisation utilities |
| `parquet.go` | Parquet dataset export |
---
## Test Coverage
| File | Tests | What is covered |
|------|-------|----------------|
| `adapter_test.go` | 13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors |
| `backend_http_test.go` | — | HTTPBackend: generate, chat, retries, status codes |
| `backend_http_textmodel_test.go` | 19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch |
| `backend_llama_test.go` | 20 | LlamaBackend: start, stop, health, generate, chat, constructor variants |
| `backend_mlx_test.go` | 8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion |
| `heuristic_test.go` | — | All nine heuristic sub-scores and LEK formula |
| `judge_test.go` | — | JSON extraction variants, ScoreSemantic, ScoreContent |
| `exact_test.go` | — | Numeric extraction patterns |
| `probes_test.go` | — | All 23 capability probe Check functions |
| `score_test.go` | — | Engine suite selection, ScoreAll grouping |
| `score_race_test.go` | 6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes |
| `agent_test.go` | 23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport |
| `benchmark_test.go` | 25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) |