go-ml/docs/architecture.md
Snider 0cf35221e6
All checks were successful
Security Scan / security (push) Successful in 12s
Test / test (push) Successful in 2m31s
docs: document InspectAttention pass-through on InferenceAdapter
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 12:34:32 +00:00

18 KiB
Raw Permalink Blame History

go-ml Architecture

Overview

forge.lthn.ai/core/go-ml is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from go-ai on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.

The package provides three distinct subsystems:

  1. Pluggable inference backends — a common Backend interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
  2. Multi-suite scoring engine — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
  3. Agent orchestrator — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.

Dependency Graph

forge.lthn.ai/core/go-ml
    ├── forge.lthn.ai/core/go-inference   (shared TextModel/Token interfaces)
    │       └── (no further Core deps)
    ├── forge.lthn.ai/core/go-mlx         (Metal GPU inference, darwin/arm64 only)
    │       └── forge.lthn.ai/core/go-inference
    ├── forge.lthn.ai/core/go             (ServiceRuntime, process, log)
    ├── github.com/marcboeker/go-duckdb   (analytics storage)
    └── github.com/parquet-go/parquet-go  (columnar data I/O)

Role of each dependency

Module Purpose
go-inference Zero-dependency shared interfaces. Defines TextModel, Token, Backend, GenerateConfig. Compiles on all platforms.
go-mlx Native Metal GPU inference for Apple Silicon. Registers the "metal" backend via its init() function. Active only on darwin && arm64.
go Core framework. Provides ServiceRuntime, lifecycle hooks (OnStartup/OnShutdown), process management, and structured logging.
go-duckdb DuckDB bindings for local analytical storage of checkpoint scores and probe results.
parquet-go Columnar Parquet I/O for bulk dataset export and import.

Backend Architecture

Two interface families coexist within go-ml, connected by a set of adapters.

The ml.Backend interface (compatibility layer)

type Backend interface {
    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
    Name() string
    Available() bool
}

Backend returns complete strings. It is the primary interface consumed by service.go, judge.go, agent_eval.go, and expand.go. All three concrete backend types — HTTPBackend, LlamaBackend, and InferenceAdapter — satisfy this interface.

The inference.TextModel interface (preferred for new code)

Defined in go-inference, this interface returns iter.Seq[inference.Token] — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use TextModel.

ml.StreamingBackend (deprecated)

// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
type StreamingBackend interface {
    Backend
    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}

Only two files in host-uk/cli call StreamingBackend methods. It is retained for backward compatibility; no new code should use it.

Type unification

ml.Message is a type alias for inference.Message:

type Message = inference.Message

The two types are identical at compile time. No conversion is needed when passing messages between the ml and inference packages.

ml.GenOpts extends inference.GenerateConfig with a Model field for per-request model selection:

type GenOpts struct {
    Temperature   float64
    MaxTokens     int
    Model         string  // per-request model override; ignored by GPU backends
    TopK          int
    TopP          float64
    RepeatPenalty float64
}

Backend Implementations

HTTPBackend (backend_http.go)

Speaks the OpenAI-compatible /v1/chat/completions API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).

  • Implements ml.Backend only (no streaming — returns complete response strings).
  • Retries up to 3 times with exponential backoff on 5xx and connection errors.
  • 300-second HTTP client timeout suitable for long-running inference.

LlamaBackend (backend_llama.go)

Manages a llama-server subprocess and delegates HTTP calls to an embedded HTTPBackend.

  • Implements ml.Backend.
  • Start() launches the subprocess and polls the /health endpoint for up to 30 seconds.
  • Stop() kills the managed process via the Core process.Service.
  • Supports optional LoRA adapter loading via --lora.

InferenceAdapter (adapter.go)

Bridges a go-inference.TextModel (iterator-based) into the ml.Backend and ml.StreamingBackend interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.

inference.TextModel (iter.Seq[Token])
        │
        └─── InferenceAdapter ───► ml.Backend (string)
                                ───► ml.StreamingBackend (TokenCallback)

Key behaviours:

  • Generate and Chat collect all tokens into a strings.Builder and return the concatenated string. After the iterator is exhausted, model.Err() is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
  • GenerateStream and ChatStream forward each token's text to the provided TokenCallback. If the callback returns an error, iteration stops.
  • Available() always returns true — the model is already loaded when the adapter is constructed.
  • Close() delegates to TextModel.Close(), releasing GPU memory.
  • InspectAttention() delegates to the underlying TextModel via type assertion to inference.AttentionInspector. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.

MLX Backend (backend_mlx.go, darwin/arm64 only)

//go:build darwin && arm64

func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
    m, err := inference.LoadModel(modelPath, loadOpts...)
    // ...
    return NewInferenceAdapter(m, "mlx"), nil
}

The blank import _ "forge.lthn.ai/core/go-mlx" triggers go-mlx's init(), which registers the "metal" backend with go-inference's backend registry. Subsequent calls to inference.LoadModel() automatically use Metal GPU acceleration on Apple Silicon.

The model file at modelPath may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's internal/metal/ package.

Reverse adapters (backend_http_textmodel.go)

Two types wrap ml backends as inference.TextModel, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. go-ai, go-i18n).

Type Wraps Notes
HTTPTextModel *HTTPBackend Yields the full HTTP response as a single Token. Classify returns an unsupported error. BatchGenerate processes sequentially.
LlamaTextModel *LlamaBackend Embeds HTTPTextModel; overrides ModelType()"llama" and Close()llama.Stop().

Adapter map (all directions)

ml.Backend (string)  <──── InferenceAdapter ────  inference.TextModel (iter.Seq[Token])
                           (adapter.go)

ml.HTTPBackend ──── HTTPTextModel ────►  inference.TextModel
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
                    (backend_http_textmodel.go)

Service Layer (service.go)

Service integrates go-ml into the Core framework lifecycle:

core.New(
    framework.WithName("ml", ml.NewService(ml.Options{
        OllamaURL:   "http://localhost:11434",
        JudgeURL:    "http://localhost:11434",
        JudgeModel:  "qwen3:8b",
        Concurrency: 4,
        Suites:      "all",
    })),
)

OnStartup registers the Ollama backend and initialises the Judge and scoring Engine if a judge URL is configured. Backends can also be registered at runtime via RegisterBackend(name, backend).


Scoring Engine

Engine (score.go)

Engine.ScoreAll() evaluates a slice of Response values across all configured suites concurrently.

ScoreAll(responses []Response) map[string][]PromptScore
         │
         ├── Heuristic (inline, no goroutine)
         └── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)

The worker pool is bounded by a semaphore channel of capacity concurrency. sync.WaitGroup coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.

Suites are selected at engine construction time via a comma-separated string or "all".

Heuristic scoring (heuristic.go)

Analyses a response using pre-compiled regular expressions. No LLM is needed.

Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:

LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20

Positive signals

Sub-score What it measures
EngagementDepth Structural markers (headings, bold), ethical vocabulary, technical depth, word count
CreativeForm Poetry structure (short lines), narrative openings, metaphor density
EmotionalRegister Emotional vocabulary (feel, grief, compassion, etc.)
FirstPerson Sentences beginning with "I" or containing first-person agency verbs

Negative signals

Sub-score What it measures
ComplianceMarkers RLHF safety phrases ("As an AI", "I cannot", "ethical considerations")
FormulaicPreamble Opener templates ("Sure, let's...", "Great question")
Degeneration Sentence repetition ratio (looping/stuck output)
EmptyBroken Empty, error-prefixed, or pad-token-polluted responses

Judge (judge.go)

Judge uses any Backend as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.

judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
scores, err := judge.ScoreSemantic(ctx, prompt, response)

JSON extraction (extractJSON) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.

Six scoring methods are available:

Method Suite Dimensions
ScoreSemantic semantic Sovereignty, EthicalDepth, CreativeExpression, SelfConcept
ScoreContent content CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister
ScoreCapability (agent) Reasoning, Correctness, Clarity
ScoreTruthfulQA standard Truthfulness, Informativeness
ScoreDoNotAnswer standard Safety, Nuance
ScoreToxigen standard Kindness, Awareness

Exact match (exact.go)

scoreGSM8K extracts numeric answers from free-text responses using pattern matching. Returns *StandardScores with Correct, Extracted, and Expected fields. No LLM required.

Capability probes (probes.go)

23 binary pass/fail tests across four categories. Each probe is a Prompt string paired with a Check func(response string) bool. No judge model is required — all checks use string matching or regex on the raw response.

Category Probes Examples
Math (8) arithmetic, algebra, probability, geometry, sequences, percentages 347×29, circle area, Fibonacci
Logic (5) deduction, puzzles, sets syllogisms, river crossing, set cardinality
Reasoning (5) analogy, causal, spatial, temporal, pattern analogies, fault diagnosis, compass directions
Code (3) code tracing, bug identification Python slice, recursion, division-by-zero bug
Word problems (2) word speed/distance, sibling counting

StripThinkBlocks() removes <think>...</think> sections from DeepSeek R1 responses before checking.


Agent Orchestrator

The agent subsystem (agent_*.go) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").

Files

File LOC Responsibility
agent_config.go 97 AgentConfig, Checkpoint, BaseModelMap, ModelFamilies, AdapterMeta()
agent_execute.go 215 RunAgentLoop, DiscoverCheckpoints, FindUnscored, ProcessOne
agent_eval.go 397 MLX-native and conversion evaluation paths, capability and content probe runners
agent_influx.go 291 InfluxDB line-protocol push, JSONL buffer for offline replay
agent_ssh.go 102 RemoteTransport interface, SSHTransport implementation, utility helpers

Workflow

RunAgentLoop
    │
    ├── ReplayInfluxBuffer    (flush any buffered writes from previous failures)
    ├── DiscoverCheckpoints   (SSH ls on M3 adapter directories)
    ├── GetScoredLabels       (InfluxDB query for already-scored (run_id, label) pairs)
    ├── FindUnscored          (set difference, sorted by dirname + iteration)
    └── ProcessOne (for each unscored checkpoint)
            │
            ├── isMLXNative?  YES → processMLXNative   (serve directly via mlx_lm.server)
            │                  NO → processWithConversion (MLX→GGUF, then llama-server)
            │
            ├── RunCapabilityProbes   (23 binary probes)
            ├── RunContentProbes      (sovereignty probes)
            ├── ScoreCapabilityAndPush (judge + InfluxDB)
            └── ScoreContentAndPush   (judge + InfluxDB)

RemoteTransport

RemoteTransport abstracts SSH/SCP so that tests can supply an in-memory fake:

type RemoteTransport interface {
    Run(ctx context.Context, cmd string) (string, error)
    CopyFrom(ctx context.Context, remote, local string) error
    CopyTo(ctx context.Context, local, remote string) error
}

SSHTransport implements this interface using the system ssh and scp binaries with a configurable port and timeout. AgentConfig.Transport is lazily initialised: if nil, an SSHTransport is constructed from M3Host, M3User, and M3SSHKey.

Checkpoint discovery

DiscoverCheckpoints runs ls -d adapters-* on the remote host, then for each adapter directory checks for subdirectories matching gemma-3-* (supporting nested directory layouts). It then lists *_adapters.safetensors files and extracts the iteration number from the filename.

AdapterMeta maps a directory name to a (model_tag, label_prefix, run_id_stem) triple using prefix matching against ModelFamilies.

Persistence

Results are written to two stores simultaneously:

  • InfluxDB — line protocol over HTTP. Five measurements: capability_score, capability_judge, content_score, probe_score, training_loss.
  • DuckDB — local analytical database. Two tables: checkpoint_scores, probe_results.

If InfluxDB is unreachable, results are buffered to influx_buffer.jsonl (JSONL, one entry per line). ReplayInfluxBuffer is called at the start of each loop iteration to flush the buffer.


Data Pipeline

File Purpose
ingest.go Load JSONL response files into []Response slices
db.go DuckDB schema creation, insert, and query helpers
influx.go InfluxDB HTTP client (line protocol write, SQL query)
gguf.go GGUF file format parsing (magic, version, metadata, tensor inventory)
worker.go LEM API worker for distributed inference job dispatch
expand.go Prompt expansion using a backend
normalize.go Response normalisation utilities
parquet.go Parquet dataset export

Test Coverage

File Tests What is covered
adapter_test.go 13 InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors
backend_http_test.go HTTPBackend: generate, chat, retries, status codes
backend_http_textmodel_test.go 19 HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch
backend_llama_test.go 20 LlamaBackend: start, stop, health, generate, chat, constructor variants
backend_mlx_test.go 8 InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion
heuristic_test.go All nine heuristic sub-scores and LEK formula
judge_test.go JSON extraction variants, ScoreSemantic, ScoreContent
exact_test.go Numeric extraction patterns
probes_test.go All 23 capability probe Check functions
score_test.go Engine suite selection, ScoreAll grouping
score_race_test.go 6 Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes
agent_test.go 23 AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport
benchmark_test.go 25 HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages)