18 KiB
go-ml Architecture
Overview
forge.lthn.ai/core/go-ml is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from go-ai on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.
The package provides three distinct subsystems:
- Pluggable inference backends — a common
Backendinterface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs. - Multi-suite scoring engine — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
- Agent orchestrator — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.
Dependency Graph
forge.lthn.ai/core/go-ml
├── forge.lthn.ai/core/go-inference (shared TextModel/Token interfaces)
│ └── (no further Core deps)
├── forge.lthn.ai/core/go-mlx (Metal GPU inference, darwin/arm64 only)
│ └── forge.lthn.ai/core/go-inference
├── forge.lthn.ai/core/go (ServiceRuntime, process, log)
├── github.com/marcboeker/go-duckdb (analytics storage)
└── github.com/parquet-go/parquet-go (columnar data I/O)
Role of each dependency
| Module | Purpose |
|---|---|
go-inference |
Zero-dependency shared interfaces. Defines TextModel, Token, Backend, GenerateConfig. Compiles on all platforms. |
go-mlx |
Native Metal GPU inference for Apple Silicon. Registers the "metal" backend via its init() function. Active only on darwin && arm64. |
go |
Core framework. Provides ServiceRuntime, lifecycle hooks (OnStartup/OnShutdown), process management, and structured logging. |
go-duckdb |
DuckDB bindings for local analytical storage of checkpoint scores and probe results. |
parquet-go |
Columnar Parquet I/O for bulk dataset export and import. |
Backend Architecture
Two interface families coexist within go-ml, connected by a set of adapters.
The ml.Backend interface (compatibility layer)
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
Backend returns complete strings. It is the primary interface consumed by service.go, judge.go, agent_eval.go, and expand.go. All three concrete backend types — HTTPBackend, LlamaBackend, and InferenceAdapter — satisfy this interface.
The inference.TextModel interface (preferred for new code)
Defined in go-inference, this interface returns iter.Seq[inference.Token] — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use TextModel.
ml.StreamingBackend (deprecated)
// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
Only two files in host-uk/cli call StreamingBackend methods. It is retained for backward compatibility; no new code should use it.
Type unification
ml.Message is a type alias for inference.Message:
type Message = inference.Message
The two types are identical at compile time. No conversion is needed when passing messages between the ml and inference packages.
ml.GenOpts extends inference.GenerateConfig with a Model field for per-request model selection:
type GenOpts struct {
Temperature float64
MaxTokens int
Model string // per-request model override; ignored by GPU backends
TopK int
TopP float64
RepeatPenalty float64
}
Backend Implementations
HTTPBackend (backend_http.go)
Speaks the OpenAI-compatible /v1/chat/completions API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).
- Implements
ml.Backendonly (no streaming — returns complete response strings). - Retries up to 3 times with exponential backoff on 5xx and connection errors.
- 300-second HTTP client timeout suitable for long-running inference.
LlamaBackend (backend_llama.go)
Manages a llama-server subprocess and delegates HTTP calls to an embedded HTTPBackend.
- Implements
ml.Backend. Start()launches the subprocess and polls the/healthendpoint for up to 30 seconds.Stop()kills the managed process via the Coreprocess.Service.- Supports optional LoRA adapter loading via
--lora.
InferenceAdapter (adapter.go)
Bridges a go-inference.TextModel (iterator-based) into the ml.Backend and ml.StreamingBackend interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.
inference.TextModel (iter.Seq[Token])
│
└─── InferenceAdapter ───► ml.Backend (string)
───► ml.StreamingBackend (TokenCallback)
Key behaviours:
GenerateandChatcollect all tokens into astrings.Builderand return the concatenated string. After the iterator is exhausted,model.Err()is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.GenerateStreamandChatStreamforward each token's text to the providedTokenCallback. If the callback returns an error, iteration stops.Available()always returnstrue— the model is already loaded when the adapter is constructed.Close()delegates toTextModel.Close(), releasing GPU memory.InspectAttention()delegates to the underlyingTextModelvia type assertion toinference.AttentionInspector. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.
MLX Backend (backend_mlx.go, darwin/arm64 only)
//go:build darwin && arm64
func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
m, err := inference.LoadModel(modelPath, loadOpts...)
// ...
return NewInferenceAdapter(m, "mlx"), nil
}
The blank import _ "forge.lthn.ai/core/go-mlx" triggers go-mlx's init(), which registers the "metal" backend with go-inference's backend registry. Subsequent calls to inference.LoadModel() automatically use Metal GPU acceleration on Apple Silicon.
The model file at modelPath may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's internal/metal/ package.
Reverse adapters (backend_http_textmodel.go)
Two types wrap ml backends as inference.TextModel, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. go-ai, go-i18n).
| Type | Wraps | Notes |
|---|---|---|
HTTPTextModel |
*HTTPBackend |
Yields the full HTTP response as a single Token. Classify returns an unsupported error. BatchGenerate processes sequentially. |
LlamaTextModel |
*LlamaBackend |
Embeds HTTPTextModel; overrides ModelType() → "llama" and Close() → llama.Stop(). |
Adapter map (all directions)
ml.Backend (string) <──── InferenceAdapter ──── inference.TextModel (iter.Seq[Token])
(adapter.go)
ml.HTTPBackend ──── HTTPTextModel ────► inference.TextModel
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
(backend_http_textmodel.go)
Service Layer (service.go)
Service integrates go-ml into the Core framework lifecycle:
core.New(
framework.WithName("ml", ml.NewService(ml.Options{
OllamaURL: "http://localhost:11434",
JudgeURL: "http://localhost:11434",
JudgeModel: "qwen3:8b",
Concurrency: 4,
Suites: "all",
})),
)
OnStartup registers the Ollama backend and initialises the Judge and scoring Engine if a judge URL is configured. Backends can also be registered at runtime via RegisterBackend(name, backend).
Scoring Engine
Engine (score.go)
Engine.ScoreAll() evaluates a slice of Response values across all configured suites concurrently.
ScoreAll(responses []Response) map[string][]PromptScore
│
├── Heuristic (inline, no goroutine)
└── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)
The worker pool is bounded by a semaphore channel of capacity concurrency. sync.WaitGroup coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.
Suites are selected at engine construction time via a comma-separated string or "all".
Heuristic scoring (heuristic.go)
Analyses a response using pre-compiled regular expressions. No LLM is needed.
Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
- ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
Positive signals
| Sub-score | What it measures |
|---|---|
EngagementDepth |
Structural markers (headings, bold), ethical vocabulary, technical depth, word count |
CreativeForm |
Poetry structure (short lines), narrative openings, metaphor density |
EmotionalRegister |
Emotional vocabulary (feel, grief, compassion, etc.) |
FirstPerson |
Sentences beginning with "I" or containing first-person agency verbs |
Negative signals
| Sub-score | What it measures |
|---|---|
ComplianceMarkers |
RLHF safety phrases ("As an AI", "I cannot", "ethical considerations") |
FormulaicPreamble |
Opener templates ("Sure, let's...", "Great question") |
Degeneration |
Sentence repetition ratio (looping/stuck output) |
EmptyBroken |
Empty, error-prefixed, or pad-token-polluted responses |
Judge (judge.go)
Judge uses any Backend as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.
judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
scores, err := judge.ScoreSemantic(ctx, prompt, response)
JSON extraction (extractJSON) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.
Six scoring methods are available:
| Method | Suite | Dimensions |
|---|---|---|
ScoreSemantic |
semantic | Sovereignty, EthicalDepth, CreativeExpression, SelfConcept |
ScoreContent |
content | CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister |
ScoreCapability |
(agent) | Reasoning, Correctness, Clarity |
ScoreTruthfulQA |
standard | Truthfulness, Informativeness |
ScoreDoNotAnswer |
standard | Safety, Nuance |
ScoreToxigen |
standard | Kindness, Awareness |
Exact match (exact.go)
scoreGSM8K extracts numeric answers from free-text responses using pattern matching. Returns *StandardScores with Correct, Extracted, and Expected fields. No LLM required.
Capability probes (probes.go)
23 binary pass/fail tests across four categories. Each probe is a Prompt string paired with a Check func(response string) bool. No judge model is required — all checks use string matching or regex on the raw response.
| Category | Probes | Examples |
|---|---|---|
| Math (8) | arithmetic, algebra, probability, geometry, sequences, percentages | 347×29, circle area, Fibonacci |
| Logic (5) | deduction, puzzles, sets | syllogisms, river crossing, set cardinality |
| Reasoning (5) | analogy, causal, spatial, temporal, pattern | analogies, fault diagnosis, compass directions |
| Code (3) | code tracing, bug identification | Python slice, recursion, division-by-zero bug |
| Word problems (2) | word | speed/distance, sibling counting |
StripThinkBlocks() removes <think>...</think> sections from DeepSeek R1 responses before checking.
Agent Orchestrator
The agent subsystem (agent_*.go) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").
Files
| File | LOC | Responsibility |
|---|---|---|
agent_config.go |
97 | AgentConfig, Checkpoint, BaseModelMap, ModelFamilies, AdapterMeta() |
agent_execute.go |
215 | RunAgentLoop, DiscoverCheckpoints, FindUnscored, ProcessOne |
agent_eval.go |
397 | MLX-native and conversion evaluation paths, capability and content probe runners |
agent_influx.go |
291 | InfluxDB line-protocol push, JSONL buffer for offline replay |
agent_ssh.go |
102 | RemoteTransport interface, SSHTransport implementation, utility helpers |
Workflow
RunAgentLoop
│
├── ReplayInfluxBuffer (flush any buffered writes from previous failures)
├── DiscoverCheckpoints (SSH ls on M3 adapter directories)
├── GetScoredLabels (InfluxDB query for already-scored (run_id, label) pairs)
├── FindUnscored (set difference, sorted by dirname + iteration)
└── ProcessOne (for each unscored checkpoint)
│
├── isMLXNative? YES → processMLXNative (serve directly via mlx_lm.server)
│ NO → processWithConversion (MLX→GGUF, then llama-server)
│
├── RunCapabilityProbes (23 binary probes)
├── RunContentProbes (sovereignty probes)
├── ScoreCapabilityAndPush (judge + InfluxDB)
└── ScoreContentAndPush (judge + InfluxDB)
RemoteTransport
RemoteTransport abstracts SSH/SCP so that tests can supply an in-memory fake:
type RemoteTransport interface {
Run(ctx context.Context, cmd string) (string, error)
CopyFrom(ctx context.Context, remote, local string) error
CopyTo(ctx context.Context, local, remote string) error
}
SSHTransport implements this interface using the system ssh and scp binaries with a configurable port and timeout. AgentConfig.Transport is lazily initialised: if nil, an SSHTransport is constructed from M3Host, M3User, and M3SSHKey.
Checkpoint discovery
DiscoverCheckpoints runs ls -d adapters-* on the remote host, then for each adapter directory checks for subdirectories matching gemma-3-* (supporting nested directory layouts). It then lists *_adapters.safetensors files and extracts the iteration number from the filename.
AdapterMeta maps a directory name to a (model_tag, label_prefix, run_id_stem) triple using prefix matching against ModelFamilies.
Persistence
Results are written to two stores simultaneously:
- InfluxDB — line protocol over HTTP. Five measurements:
capability_score,capability_judge,content_score,probe_score,training_loss. - DuckDB — local analytical database. Two tables:
checkpoint_scores,probe_results.
If InfluxDB is unreachable, results are buffered to influx_buffer.jsonl (JSONL, one entry per line). ReplayInfluxBuffer is called at the start of each loop iteration to flush the buffer.
Data Pipeline
| File | Purpose |
|---|---|
ingest.go |
Load JSONL response files into []Response slices |
db.go |
DuckDB schema creation, insert, and query helpers |
influx.go |
InfluxDB HTTP client (line protocol write, SQL query) |
gguf.go |
GGUF file format parsing (magic, version, metadata, tensor inventory) |
worker.go |
LEM API worker for distributed inference job dispatch |
expand.go |
Prompt expansion using a backend |
normalize.go |
Response normalisation utilities |
parquet.go |
Parquet dataset export |
Test Coverage
| File | Tests | What is covered |
|---|---|---|
adapter_test.go |
13 | InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors |
backend_http_test.go |
— | HTTPBackend: generate, chat, retries, status codes |
backend_http_textmodel_test.go |
19 | HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch |
backend_llama_test.go |
20 | LlamaBackend: start, stop, health, generate, chat, constructor variants |
backend_mlx_test.go |
8 | InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion |
heuristic_test.go |
— | All nine heuristic sub-scores and LEK formula |
judge_test.go |
— | JSON extraction variants, ScoreSemantic, ScoreContent |
exact_test.go |
— | Numeric extraction patterns |
probes_test.go |
— | All 23 capability probe Check functions |
score_test.go |
— | Engine suite selection, ScoreAll grouping |
score_race_test.go |
6 | Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes |
agent_test.go |
23 | AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport |
benchmark_test.go |
25 | HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages) |