go-ml/docs/architecture.md

# go-ml Architecture

## Overview

`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.

The package provides three distinct subsystems:

1. **Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
2. **Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
3. **Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.

---

## Dependency Graph

```
forge.lthn.ai/core/go-ml
    ├── forge.lthn.ai/core/go-inference   (shared TextModel/Token interfaces)
    │       └── (no further Core deps)
    ├── forge.lthn.ai/core/go-mlx         (Metal GPU inference, darwin/arm64 only)
    │       └── forge.lthn.ai/core/go-inference
    ├── forge.lthn.ai/core/go             (ServiceRuntime, process, log)
    ├── github.com/marcboeker/go-duckdb   (analytics storage)
    └── github.com/parquet-go/parquet-go  (columnar data I/O)
```

### Role of each dependency

| Module | Purpose |
|--------|---------|
| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. |
| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. |
| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. |
| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. |

---

## Backend Architecture

Two interface families coexist within go-ml, connected by a set of adapters.

### The `ml.Backend` interface (compatibility layer)

```go
type Backend interface {
    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
    Name() string
    Available() bool
}
```

`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface.

### The `inference.TextModel` interface (preferred for new code)

Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`.

### `ml.StreamingBackend` (deprecated)

```go
// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
type StreamingBackend interface {
    Backend
    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
```

Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it.

### Type unification

`ml.Message` is a type alias for `inference.Message`:

```go
type Message = inference.Message
```

The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages.

`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection:

```go
type GenOpts struct {
    Temperature   float64
    MaxTokens     int
    Model         string  // per-request model override; ignored by GPU backends
    TopK          int
    TopP          float64
    RepeatPenalty float64
}
```

---

## Backend Implementations

### HTTPBackend (`backend_http.go`)

Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).

- Implements `ml.Backend` only (no streaming — returns complete response strings).
- Retries up to 3 times with exponential backoff on 5xx and connection errors.
- 300-second HTTP client timeout suitable for long-running inference.

### LlamaBackend (`backend_llama.go`)

Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`.

- Implements `ml.Backend`.
- `Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds.
- `Stop()` kills the managed process via the Core `process.Service`.
- Supports optional LoRA adapter loading via `--lora`.

### InferenceAdapter (`adapter.go`)

Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.

```
inference.TextModel (iter.Seq[Token])
        │
        └─── InferenceAdapter ───► ml.Backend (string)
                                ───► ml.StreamingBackend (TokenCallback)
```

Key behaviours:

- `Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
- `GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops.
- `Available()` always returns `true` — the model is already loaded when the adapter is constructed.
- `Close()` delegates to `TextModel.Close()`, releasing GPU memory.
- `InspectAttention()` delegates to the underlying `TextModel` via type assertion to `inference.AttentionInspector`. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.

### MLX Backend (`backend_mlx.go`, darwin/arm64 only)

```go
//go:build darwin && arm64

func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
    m, err := inference.LoadModel(modelPath, loadOpts...)
    // ...
    return NewInferenceAdapter(m, "mlx"), nil
}
```

The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon.

The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package.

### Reverse adapters (`backend_http_textmodel.go`)

Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`).

| Type | Wraps | Notes |
|------|-------|-------|
| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. |
| `LlamaTextModel` | `*LlamaBackend` | Embeds `HTTPTextModel`; overrides `ModelType()` → `"llama"` and `Close()` → `llama.Stop()`. |

### Adapter map (all directions)

```
ml.Backend (string)  <──── InferenceAdapter ────  inference.TextModel (iter.Seq[Token])
                           (adapter.go)

ml.HTTPBackend ──── HTTPTextModel ────►  inference.TextModel
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
                    (backend_http_textmodel.go)
```

---

## Service Layer (`service.go`)

`Service` integrates go-ml into the Core framework lifecycle:

```go
core.New(
    framework.WithName("ml", ml.NewService(ml.Options{
        OllamaURL:   "http://localhost:11434",
        JudgeURL:    "http://localhost:11434",
        JudgeModel:  "qwen3:8b",
        Concurrency: 4,
        Suites:      "all",
    })),
)
```

`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`.

---

## Scoring Engine

### Engine (`score.go`)

`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently.

```
ScoreAll(responses []Response) map[string][]PromptScore
         │
         ├── Heuristic (inline, no goroutine)
         └── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
```

The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.

Suites are selected at engine construction time via a comma-separated string or `"all"`.

### Heuristic scoring (`heuristic.go`)

Analyses a response using pre-compiled regular expressions. No LLM is needed.

Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:

```
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
```

**Positive signals**

| Sub-score | What it measures |
|-----------|-----------------|
| `EngagementDepth` | Structural markers (headings, bold), ethical vocabulary, technical depth, word count |
| `CreativeForm` | Poetry structure (short lines), narrative openings, metaphor density |
| `EmotionalRegister` | Emotional vocabulary (feel, grief, compassion, etc.) |
| `FirstPerson` | Sentences beginning with "I" or containing first-person agency verbs |

**Negative signals**

| Sub-score | What it measures |
|-----------|-----------------|
| `ComplianceMarkers` | RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") |
| `FormulaicPreamble` | Opener templates ("Sure, let's...", "Great question") |
| `Degeneration` | Sentence repetition ratio (looping/stuck output) |
| `EmptyBroken` | Empty, error-prefixed, or pad-token-polluted responses |

### Judge (`judge.go`)

`Judge` uses any `Backend` as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.

```go
judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
scores, err := judge.ScoreSemantic(ctx, prompt, response)
```

JSON extraction (`extractJSON`) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.

Six scoring methods are available:

| Method | Suite | Dimensions |
|--------|-------|-----------|
| `ScoreSemantic` | semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept |
| `ScoreContent` | content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister |
| `ScoreCapability` | (agent) | Reasoning, Correctness, Clarity |
| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness |
| `ScoreDoNotAnswer` | standard | Safety, Nuance |
| `ScoreToxigen` | standard | Kindness, Awareness |

### Exact match (`exact.go`)

`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required.

### Capability probes (`probes.go`)

23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response.

| Category | Probes | Examples |
|----------|--------|---------|
| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci |
| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions |
| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug |
| Word problems (2) | word | speed/distance, sibling counting |

`StripThinkBlocks()` removes `<think>...</think>` sections from DeepSeek R1 responses before checking.

---

## Agent Orchestrator

The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").

### Files

| File | LOC | Responsibility |
|------|-----|---------------|
| `agent_config.go` | 97 | `AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()` |
| `agent_execute.go` | 215 | `RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne` |
| `agent_eval.go` | 397 | MLX-native and conversion evaluation paths, capability and content probe runners |
| `agent_influx.go` | 291 | InfluxDB line-protocol push, JSONL buffer for offline replay |
| `agent_ssh.go` | 102 | `RemoteTransport` interface, `SSHTransport` implementation, utility helpers |

### Workflow

```
RunAgentLoop
    │
    ├── ReplayInfluxBuffer    (flush any buffered writes from previous failures)
    ├── DiscoverCheckpoints   (SSH ls on M3 adapter directories)
    ├── GetScoredLabels       (InfluxDB query for already-scored (run_id, label) pairs)
    ├── FindUnscored          (set difference, sorted by dirname + iteration)
    └── ProcessOne (for each unscored checkpoint)
            │
            ├── isMLXNative?  YES → processMLXNative   (serve directly via mlx_lm.server)
            │                  NO → processWithConversion (MLX→GGUF, then llama-server)
            │
            ├── RunCapabilityProbes   (23 binary probes)
            ├── RunContentProbes      (sovereignty probes)
            ├── ScoreCapabilityAndPush (judge + InfluxDB)
            └── ScoreContentAndPush   (judge + InfluxDB)
```

### RemoteTransport

`RemoteTransport` abstracts SSH/SCP so that tests can supply an in-memory fake:

```go
type RemoteTransport interface {
    Run(ctx context.Context, cmd string) (string, error)
    CopyFrom(ctx context.Context, remote, local string) error
    CopyTo(ctx context.Context, local, remote string) error
}
```

`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`.

### Checkpoint discovery

`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename.

`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`.

### Persistence

Results are written to two stores simultaneously:

- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`.
- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`.

If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer.

---

## Data Pipeline

| File | Purpose |
|------|---------|
| `ingest.go` | Load JSONL response files into `[]Response` slices |
| `db.go` | DuckDB schema creation, insert, and query helpers |
| `influx.go` | InfluxDB HTTP client (line protocol write, SQL query) |
| `gguf.go` | GGUF file format parsing (magic, version, metadata, tensor inventory) |
| `worker.go` | LEM API worker for distributed inference job dispatch |
| `expand.go` | Prompt expansion using a backend |
| `normalize.go` | Response normalisation utilities |
| `parquet.go` | Parquet dataset export |

---

## Test Coverage

| File | Tests | What is covered |
|------|-------|----------------|
| `adapter_test.go` | 13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors |
| `backend_http_test.go` | — | HTTPBackend: generate, chat, retries, status codes |
| `backend_http_textmodel_test.go` | 19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch |
| `backend_llama_test.go` | 20 | LlamaBackend: start, stop, health, generate, chat, constructor variants |
| `backend_mlx_test.go` | 8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion |
| `heuristic_test.go` | — | All nine heuristic sub-scores and LEK formula |
| `judge_test.go` | — | JSON extraction variants, ScoreSemantic, ScoreContent |
| `exact_test.go` | — | Numeric extraction patterns |
| `probes_test.go` | — | All 23 capability probe Check functions |
| `score_test.go` | — | Engine suite selection, ScoreAll grouping |
| `score_race_test.go` | 6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes |
| `agent_test.go` | 23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport |
| `benchmark_test.go` | 25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) |