`forge.lthn.ai/core/go-ml` is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from `go-ai` on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.
The package provides three distinct subsystems:
1.**Pluggable inference backends** — a common `Backend` interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
2.**Multi-suite scoring engine** — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
3.**Agent orchestrator** — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.
└── github.com/parquet-go/parquet-go (columnar data I/O)
```
### Role of each dependency
| Module | Purpose |
|--------|---------|
| `go-inference` | Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms. |
| `go-mlx` | Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`. |
| `go` | Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging. |
| `go-duckdb` | DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
| `parquet-go` | Columnar Parquet I/O for bulk dataset export and import. |
---
## Backend Architecture
Two interface families coexist within go-ml, connected by a set of adapters.
### The `ml.Backend` interface (compatibility layer)
`Backend` returns complete strings. It is the primary interface consumed by `service.go`, `judge.go`, `agent_eval.go`, and `expand.go`. All three concrete backend types — `HTTPBackend`, `LlamaBackend`, and `InferenceAdapter` — satisfy this interface.
### The `inference.TextModel` interface (preferred for new code)
Defined in `go-inference`, this interface returns `iter.Seq[inference.Token]` — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use `TextModel`.
### `ml.StreamingBackend` (deprecated)
```go
// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
Only two files in `host-uk/cli` call `StreamingBackend` methods. It is retained for backward compatibility; no new code should use it.
### Type unification
`ml.Message` is a type alias for `inference.Message`:
```go
type Message = inference.Message
```
The two types are identical at compile time. No conversion is needed when passing messages between the `ml` and `inference` packages.
`ml.GenOpts` extends `inference.GenerateConfig` with a `Model` field for per-request model selection:
```go
type GenOpts struct {
Temperature float64
MaxTokens int
Model string // per-request model override; ignored by GPU backends
TopK int
TopP float64
RepeatPenalty float64
}
```
---
## Backend Implementations
### HTTPBackend (`backend_http.go`)
Speaks the OpenAI-compatible `/v1/chat/completions` API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).
- Implements `ml.Backend` only (no streaming — returns complete response strings).
- Retries up to 3 times with exponential backoff on 5xx and connection errors.
- 300-second HTTP client timeout suitable for long-running inference.
### LlamaBackend (`backend_llama.go`)
Manages a `llama-server` subprocess and delegates HTTP calls to an embedded `HTTPBackend`.
- Implements `ml.Backend`.
-`Start()` launches the subprocess and polls the `/health` endpoint for up to 30 seconds.
-`Stop()` kills the managed process via the Core `process.Service`.
- Supports optional LoRA adapter loading via `--lora`.
### InferenceAdapter (`adapter.go`)
Bridges a `go-inference.TextModel` (iterator-based) into the `ml.Backend` and `ml.StreamingBackend` interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.
```
inference.TextModel (iter.Seq[Token])
│
└─── InferenceAdapter ───► ml.Backend (string)
───► ml.StreamingBackend (TokenCallback)
```
Key behaviours:
-`Generate` and `Chat` collect all tokens into a `strings.Builder` and return the concatenated string. After the iterator is exhausted, `model.Err()` is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
-`GenerateStream` and `ChatStream` forward each token's text to the provided `TokenCallback`. If the callback returns an error, iteration stops.
-`Available()` always returns `true` — the model is already loaded when the adapter is constructed.
-`Close()` delegates to `TextModel.Close()`, releasing GPU memory.
-`InspectAttention()` delegates to the underlying `TextModel` via type assertion to `inference.AttentionInspector`. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.
m, err := inference.LoadModel(modelPath, loadOpts...)
// ...
return NewInferenceAdapter(m, "mlx"), nil
}
```
The blank import `_ "forge.lthn.ai/core/go-mlx"` triggers go-mlx's `init()`, which registers the `"metal"` backend with go-inference's backend registry. Subsequent calls to `inference.LoadModel()` automatically use Metal GPU acceleration on Apple Silicon.
The model file at `modelPath` may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's `internal/metal/` package.
Two types wrap `ml` backends as `inference.TextModel`, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. `go-ai`, `go-i18n`).
| Type | Wraps | Notes |
|------|-------|-------|
| `HTTPTextModel` | `*HTTPBackend` | Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially. |
`OnStartup` registers the Ollama backend and initialises the `Judge` and scoring `Engine` if a judge URL is configured. Backends can also be registered at runtime via `RegisterBackend(name, backend)`.
---
## Scoring Engine
### Engine (`score.go`)
`Engine.ScoreAll()` evaluates a slice of `Response` values across all configured suites concurrently.
└── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
```
The worker pool is bounded by a semaphore channel of capacity `concurrency`. `sync.WaitGroup` coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.
Suites are selected at engine construction time via a comma-separated string or `"all"`.
### Heuristic scoring (`heuristic.go`)
Analyses a response using pre-compiled regular expressions. No LLM is needed.
Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:
| `ScoreTruthfulQA` | standard | Truthfulness, Informativeness |
| `ScoreDoNotAnswer` | standard | Safety, Nuance |
| `ScoreToxigen` | standard | Kindness, Awareness |
### Exact match (`exact.go`)
`scoreGSM8K` extracts numeric answers from free-text responses using pattern matching. Returns `*StandardScores` with `Correct`, `Extracted`, and `Expected` fields. No LLM required.
### Capability probes (`probes.go`)
23 binary pass/fail tests across four categories. Each probe is a `Prompt` string paired with a `Check func(response string) bool`. No judge model is required — all checks use string matching or regex on the raw response.
| Category | Probes | Examples |
|----------|--------|---------|
| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | `347×29`, circle area, Fibonacci |
| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
| Word problems (2) | word | speed/distance, sibling counting |
`StripThinkBlocks()` removes `<think>...</think>` sections from DeepSeek R1 responses before checking.
---
## Agent Orchestrator
The agent subsystem (`agent_*.go`) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").
`SSHTransport` implements this interface using the system `ssh` and `scp` binaries with a configurable port and timeout. `AgentConfig.Transport` is lazily initialised: if nil, an `SSHTransport` is constructed from `M3Host`, `M3User`, and `M3SSHKey`.
### Checkpoint discovery
`DiscoverCheckpoints` runs `ls -d adapters-*` on the remote host, then for each adapter directory checks for subdirectories matching `gemma-3-*` (supporting nested directory layouts). It then lists `*_adapters.safetensors` files and extracts the iteration number from the filename.
`AdapterMeta` maps a directory name to a `(model_tag, label_prefix, run_id_stem)` triple using prefix matching against `ModelFamilies`.
### Persistence
Results are written to two stores simultaneously:
- **InfluxDB** — line protocol over HTTP. Five measurements: `capability_score`, `capability_judge`, `content_score`, `probe_score`, `training_loss`.
- **DuckDB** — local analytical database. Two tables: `checkpoint_scores`, `probe_results`.
If InfluxDB is unreachable, results are buffered to `influx_buffer.jsonl` (JSONL, one entry per line). `ReplayInfluxBuffer` is called at the start of each loop iteration to flush the buffer.