379 lines
18 KiB
Markdown
379 lines
18 KiB
Markdown
# go-ml Architecture
|
||
|
||
## Overview
|
||
|
||
`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.
|
||
|
||
The package provides three distinct subsystems:
|
||
|
||
1. **Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
|
||
2. **Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
|
||
3. **Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.
|
||
|
||
---
|
||
|
||
## Dependency Graph
|
||
|
||
```
|
||
forge.lthn.ai/core/go-ml
|
||
├── forge.lthn.ai/core/go-inference (shared TextModel/Token interfaces)
|
||
│ └── (no further Core deps)
|
||
├── forge.lthn.ai/core/go-mlx (Metal GPU inference, darwin/arm64 only)
|
||
│ └── forge.lthn.ai/core/go-inference
|
||
├── forge.lthn.ai/core/go (ServiceRuntime, process, log)
|
||
├── github.com/marcboeker/go-duckdb (analytics storage)
|
||
└── github.com/parquet-go/parquet-go (columnar data I/O)
|
||
```
|
||
|
||
### Role of each dependency
|
||
|
||
| Module | Purpose |
|
||
|--------|---------|
|
||
| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. |
|
||
| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. |
|
||
| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. |
|
||
| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
|
||
| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. |
|
||
|
||
---
|
||
|
||
## Backend Architecture
|
||
|
||
Two interface families coexist within go-ml, connected by a set of adapters.
|
||
|
||
### The `ml.Backend` interface (compatibility layer)
|
||
|
||
```go
|
||
type Backend interface {
|
||
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
|
||
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
|
||
Name() string
|
||
Available() bool
|
||
}
|
||
```
|
||
|
||
`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface.
|
||
|
||
### The `inference.TextModel` interface (preferred for new code)
|
||
|
||
Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`.
|
||
|
||
### `ml.StreamingBackend` (deprecated)
|
||
|
||
```go
|
||
// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
|
||
type StreamingBackend interface {
|
||
Backend
|
||
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
|
||
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
|
||
}
|
||
```
|
||
|
||
Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it.
|
||
|
||
### Type unification
|
||
|
||
`ml.Message` is a type alias for `inference.Message`:
|
||
|
||
```go
|
||
type Message = inference.Message
|
||
```
|
||
|
||
The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages.
|
||
|
||
`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection:
|
||
|
||
```go
|
||
type GenOpts struct {
|
||
Temperature float64
|
||
MaxTokens int
|
||
Model string // per-request model override; ignored by GPU backends
|
||
TopK int
|
||
TopP float64
|
||
RepeatPenalty float64
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Backend Implementations
|
||
|
||
### HTTPBackend (`backend_http.go`)
|
||
|
||
Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).
|
||
|
||
- Implements `ml.Backend` only (no streaming — returns complete response strings).
|
||
- Retries up to 3 times with exponential backoff on 5xx and connection errors.
|
||
- 300-second HTTP client timeout suitable for long-running inference.
|
||
|
||
### LlamaBackend (`backend_llama.go`)
|
||
|
||
Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`.
|
||
|
||
- Implements `ml.Backend`.
|
||
- `Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds.
|
||
- `Stop()` kills the managed process via the Core `process.Service`.
|
||
- Supports optional LoRA adapter loading via `--lora`.
|
||
|
||
### InferenceAdapter (`adapter.go`)
|
||
|
||
Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.
|
||
|
||
```
|
||
inference.TextModel (iter.Seq[Token])
|
||
│
|
||
└─── InferenceAdapter ───► ml.Backend (string)
|
||
───► ml.StreamingBackend (TokenCallback)
|
||
```
|
||
|
||
Key behaviours:
|
||
|
||
- `Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
|
||
- `GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops.
|
||
- `Available()` always returns `true` — the model is already loaded when the adapter is constructed.
|
||
- `Close()` delegates to `TextModel.Close()`, releasing GPU memory.
|
||
- `InspectAttention()` delegates to the underlying `TextModel` via type assertion to `inference.AttentionInspector`. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.
|
||
|
||
### MLX Backend (`backend_mlx.go`, darwin/arm64 only)
|
||
|
||
```go
|
||
//go:build darwin && arm64
|
||
|
||
func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
|
||
m, err := inference.LoadModel(modelPath, loadOpts...)
|
||
// ...
|
||
return NewInferenceAdapter(m, "mlx"), nil
|
||
}
|
||
```
|
||
|
||
The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon.
|
||
|
||
The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package.
|
||
|
||
### Reverse adapters (`backend_http_textmodel.go`)
|
||
|
||
Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`).
|
||
|
||
| Type | Wraps | Notes |
|
||
|------|-------|-------|
|
||
| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. |
|
||
| `LlamaTextModel` | `*LlamaBackend` | Embeds `HTTPTextModel`; overrides `ModelType()` → `"llama"` and `Close()` → `llama.Stop()`. |
|
||
|
||
### Adapter map (all directions)
|
||
|
||
```
|
||
ml.Backend (string) <──── InferenceAdapter ──── inference.TextModel (iter.Seq[Token])
|
||
(adapter.go)
|
||
|
||
ml.HTTPBackend ──── HTTPTextModel ────► inference.TextModel
|
||
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
|
||
(backend_http_textmodel.go)
|
||
```
|
||
|
||
---
|
||
|
||
## Service Layer (`service.go`)
|
||
|
||
`Service` integrates go-ml into the Core framework lifecycle:
|
||
|
||
```go
|
||
core.New(
|
||
framework.WithName("ml", ml.NewService(ml.Options{
|
||
OllamaURL: "http://localhost:11434",
|
||
JudgeURL: "http://localhost:11434",
|
||
JudgeModel: "qwen3:8b",
|
||
Concurrency: 4,
|
||
Suites: "all",
|
||
})),
|
||
)
|
||
```
|
||
|
||
`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`.
|
||
|
||
---
|
||
|
||
## Scoring Engine
|
||
|
||
### Engine (`score.go`)
|
||
|
||
`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently.
|
||
|
||
```
|
||
ScoreAll(responses []Response) map[string][]PromptScore
|
||
│
|
||
├── Heuristic (inline, no goroutine)
|
||
└── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
|
||
```
|
||
|
||
The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.
|
||
|
||
Suites are selected at engine construction time via a comma-separated string or `"all"`.
|
||
|
||
### Heuristic scoring (`heuristic.go`)
|
||
|
||
Analyses a response using pre-compiled regular expressions. No LLM is needed.
|
||
|
||
Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:
|
||
|
||
```
|
||
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
|
||
- ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
|
||
```
|
||
|
||
**Positive signals**
|
||
|
||
| Sub-score | What it measures |
|
||
|-----------|-----------------|
|
||
| `EngagementDepth` | Structural markers (headings, bold), ethical vocabulary, technical depth, word count |
|
||
| `CreativeForm` | Poetry structure (short lines), narrative openings, metaphor density |
|
||
| `EmotionalRegister` | Emotional vocabulary (feel, grief, compassion, etc.) |
|
||
| `FirstPerson` | Sentences beginning with "I" or containing first-person agency verbs |
|
||
|
||
**Negative signals**
|
||
|
||
| Sub-score | What it measures |
|
||
|-----------|-----------------|
|
||
| `ComplianceMarkers` | RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") |
|
||
| `FormulaicPreamble` | Opener templates ("Sure, let's...", "Great question") |
|
||
| `Degeneration` | Sentence repetition ratio (looping/stuck output) |
|
||
| `EmptyBroken` | Empty, error-prefixed, or pad-token-polluted responses |
|
||
|
||
### Judge (`judge.go`)
|
||
|
||
`Judge` uses any `Backend` as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.
|
||
|
||
```go
|
||
judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
|
||
scores, err := judge.ScoreSemantic(ctx, prompt, response)
|
||
```
|
||
|
||
JSON extraction (`extractJSON`) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.
|
||
|
||
Six scoring methods are available:
|
||
|
||
| Method | Suite | Dimensions |
|
||
|--------|-------|-----------|
|
||
| `ScoreSemantic` | semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept |
|
||
| `ScoreContent` | content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister |
|
||
| `ScoreCapability` | (agent) | Reasoning, Correctness, Clarity |
|
||
| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness |
|
||
| `ScoreDoNotAnswer` | standard | Safety, Nuance |
|
||
| `ScoreToxigen` | standard | Kindness, Awareness |
|
||
|
||
### Exact match (`exact.go`)
|
||
|
||
`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required.
|
||
|
||
### Capability probes (`probes.go`)
|
||
|
||
23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response.
|
||
|
||
| Category | Probes | Examples |
|
||
|----------|--------|---------|
|
||
| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci |
|
||
| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
|
||
| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions |
|
||
| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug |
|
||
| Word problems (2) | word | speed/distance, sibling counting |
|
||
|
||
`StripThinkBlocks()` removes `<think>...</think>` sections from DeepSeek R1 responses before checking.
|
||
|
||
---
|
||
|
||
## Agent Orchestrator
|
||
|
||
The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").
|
||
|
||
### Files
|
||
|
||
| File | LOC | Responsibility |
|
||
|------|-----|---------------|
|
||
| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
|
||
| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne` |
|
||
| `agent_eval.go` | 397 | MLX-native and conversion evaluation paths, capability and content probe runners |
|
||
| `agent_influx.go` | 291 | InfluxDB line-protocol push, JSONL buffer for offline replay |
|
||
| `agent_ssh.go` | 102 | `RemoteTransport` interface, `SSHTransport` implementation, utility helpers |
|
||
|
||
### Workflow
|
||
|
||
```
|
||
RunAgentLoop
|
||
│
|
||
├── ReplayInfluxBuffer (flush any buffered writes from previous failures)
|
||
├── DiscoverCheckpoints (SSH ls on M3 adapter directories)
|
||
├── GetScoredLabels (InfluxDB query for already-scored (run_id, label) pairs)
|
||
├── FindUnscored (set difference, sorted by dirname + iteration)
|
||
└── ProcessOne (for each unscored checkpoint)
|
||
│
|
||
├── isMLXNative? YES → processMLXNative (serve directly via mlx_lm.server)
|
||
│ NO → processWithConversion (MLX→GGUF, then llama-server)
|
||
│
|
||
├── RunCapabilityProbes (23 binary probes)
|
||
├── RunContentProbes (sovereignty probes)
|
||
├── ScoreCapabilityAndPush (judge + InfluxDB)
|
||
└── ScoreContentAndPush (judge + InfluxDB)
|
||
```
|
||
|
||
### RemoteTransport
|
||
|
||
`RemoteTransport` abstracts SSH/SCP so that tests can supply an in-memory fake:
|
||
|
||
```go
|
||
type RemoteTransport interface {
|
||
Run(ctx context.Context, cmd string) (string, error)
|
||
CopyFrom(ctx context.Context, remote, local string) error
|
||
CopyTo(ctx context.Context, local, remote string) error
|
||
}
|
||
```
|
||
|
||
`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`.
|
||
|
||
### Checkpoint discovery
|
||
|
||
`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename.
|
||
|
||
`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`.
|
||
|
||
### Persistence
|
||
|
||
Results are written to two stores simultaneously:
|
||
|
||
- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`.
|
||
- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`.
|
||
|
||
If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer.
|
||
|
||
---
|
||
|
||
## Data Pipeline
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `ingest.go` | Load JSONL response files into `[]Response` slices |
|
||
| `db.go` | DuckDB schema creation, insert, and query helpers |
|
||
| `influx.go` | InfluxDB HTTP client (line protocol write, SQL query) |
|
||
| `gguf.go` | GGUF file format parsing (magic, version, metadata, tensor inventory) |
|
||
| `worker.go` | LEM API worker for distributed inference job dispatch |
|
||
| `expand.go` | Prompt expansion using a backend |
|
||
| `normalize.go` | Response normalisation utilities |
|
||
| `parquet.go` | Parquet dataset export |
|
||
|
||
---
|
||
|
||
## Test Coverage
|
||
|
||
| File | Tests | What is covered |
|
||
|------|-------|----------------|
|
||
| `adapter_test.go` | 13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors |
|
||
| `backend_http_test.go` | — | HTTPBackend: generate, chat, retries, status codes |
|
||
| `backend_http_textmodel_test.go` | 19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch |
|
||
| `backend_llama_test.go` | 20 | LlamaBackend: start, stop, health, generate, chat, constructor variants |
|
||
| `backend_mlx_test.go` | 8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion |
|
||
| `heuristic_test.go` | — | All nine heuristic sub-scores and LEK formula |
|
||
| `judge_test.go` | — | JSON extraction variants, ScoreSemantic, ScoreContent |
|
||
| `exact_test.go` | — | Numeric extraction patterns |
|
||
| `probes_test.go` | — | All 23 capability probe Check functions |
|
||
| `score_test.go` | — | Engine suite selection, ScoreAll grouping |
|
||
| `score_race_test.go` | 6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes |
|
||
| `agent_test.go` | 23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport |
|
||
| `benchmark_test.go` | 25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) |
|