core/go-ml

Snider 0cf35221e6

Security Scan / security (push) Successful in 12s

Details

Test / test (push) Successful in 2m31s

Details

docs: document InspectAttention pass-through on InferenceAdapter

Co-Authored-By: Virgil <virgil@lethean.io>

2026-02-23 12:34:32 +00:00

18 KiB

Raw Permalink Blame History

go-ml Architecture

Overview

forge.lthn.ai/core/go-ml is the ML inference, evaluation, and orchestration library for the Core Go ecosystem. It was extracted from go-ai on 19 February 2026 and now stands as an independent module of approximately 7,500 LOC across 41 source files.

The package provides three distinct subsystems:

Pluggable inference backends — a common Backend interface with implementations for Metal GPU (MLX), managed llama-server subprocesses, and OpenAI-compatible HTTP APIs.
Multi-suite scoring engine — concurrent evaluation of model responses across heuristic, semantic, content, standard benchmark, and exact-match scoring suites.
Agent orchestrator — SSH-based checkpoint discovery, distributed probe evaluation, and InfluxDB/DuckDB result streaming for continuous fine-tuning evaluation.

Dependency Graph

forge.lthn.ai/core/go-ml
    ├── forge.lthn.ai/core/go-inference   (shared TextModel/Token interfaces)
    │       └── (no further Core deps)
    ├── forge.lthn.ai/core/go-mlx         (Metal GPU inference, darwin/arm64 only)
    │       └── forge.lthn.ai/core/go-inference
    ├── forge.lthn.ai/core/go             (ServiceRuntime, process, log)
    ├── github.com/marcboeker/go-duckdb   (analytics storage)
    └── github.com/parquet-go/parquet-go  (columnar data I/O)

Role of each dependency

Module	Purpose
`go-inference`	Zero-dependency shared interfaces. Defines `TextModel`, `Token`, `Backend`, `GenerateConfig`. Compiles on all platforms.
`go-mlx`	Native Metal GPU inference for Apple Silicon. Registers the `"metal"` backend via its `init()` function. Active only on `darwin && arm64`.
`go`	Core framework. Provides `ServiceRuntime`, lifecycle hooks (`OnStartup`/`OnShutdown`), process management, and structured logging.
`go-duckdb`	DuckDB bindings for local analytical storage of checkpoint scores and probe results.
`parquet-go`	Columnar Parquet I/O for bulk dataset export and import.

Backend Architecture

Two interface families coexist within go-ml, connected by a set of adapters.

The `ml.Backend` interface (compatibility layer)

type Backend interface {
    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
    Name() string
    Available() bool
}

Backend returns complete strings. It is the primary interface consumed by service.go, judge.go, agent_eval.go, and expand.go. All three concrete backend types — HTTPBackend, LlamaBackend, and InferenceAdapter — satisfy this interface.

The `inference.TextModel` interface (preferred for new code)

Defined in go-inference, this interface returns iter.Seq[inference.Token] — a Go 1.23 range-over-function iterator. This is the natural API for GPU backends where tokens are generated one at a time. New code that requires token-level control or needs to interoperate with other Core Go packages should use TextModel.

`ml.StreamingBackend` (deprecated)

// Deprecated: use inference.TextModel with iter.Seq[Token] directly.
type StreamingBackend interface {
    Backend
    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}

Only two files in host-uk/cli call StreamingBackend methods. It is retained for backward compatibility; no new code should use it.

Type unification

ml.Message is a type alias for inference.Message:

type Message = inference.Message

The two types are identical at compile time. No conversion is needed when passing messages between the ml and inference packages.

ml.GenOpts extends inference.GenerateConfig with a Model field for per-request model selection:

type GenOpts struct {
    Temperature   float64
    MaxTokens     int
    Model         string  // per-request model override; ignored by GPU backends
    TopK          int
    TopP          float64
    RepeatPenalty float64
}

Backend Implementations

HTTPBackend (`backend_http.go`)

Speaks the OpenAI-compatible /v1/chat/completions API. Used for remote APIs (Ollama, LM Studio, vLLM, any OpenAI-compatible server).

Implements ml.Backend only (no streaming — returns complete response strings).
Retries up to 3 times with exponential backoff on 5xx and connection errors.
300-second HTTP client timeout suitable for long-running inference.

LlamaBackend (`backend_llama.go`)

Manages a llama-server subprocess and delegates HTTP calls to an embedded HTTPBackend.

Implements ml.Backend.
Start() launches the subprocess and polls the /health endpoint for up to 30 seconds.
Stop() kills the managed process via the Core process.Service.
Supports optional LoRA adapter loading via --lora.

InferenceAdapter (`adapter.go`)

Bridges a go-inference.TextModel (iterator-based) into the ml.Backend and ml.StreamingBackend interfaces. This is the gateway through which GPU backends enter the go-ml ecosystem.

inference.TextModel (iter.Seq[Token])
        │
        └─── InferenceAdapter ───► ml.Backend (string)
                                ───► ml.StreamingBackend (TokenCallback)

Key behaviours:

Generate and Chat collect all tokens into a strings.Builder and return the concatenated string. After the iterator is exhausted, model.Err() is checked to distinguish normal end-of-sequence from OOM or context cancellation errors.
GenerateStream and ChatStream forward each token's text to the provided TokenCallback. If the callback returns an error, iteration stops.
Available() always returns true — the model is already loaded when the adapter is constructed.
Close() delegates to TextModel.Close(), releasing GPU memory.
InspectAttention() delegates to the underlying TextModel via type assertion to inference.AttentionInspector. Returns an error if the backend doesn't support attention inspection. This enables LEM's Q/K Bone Orientation analysis through the adapter without consumers needing to unwrap the underlying model.

MLX Backend (`backend_mlx.go`, darwin/arm64 only)

//go:build darwin && arm64

func NewMLXBackend(modelPath string, loadOpts ...inference.LoadOption) (*InferenceAdapter, error) {
    m, err := inference.LoadModel(modelPath, loadOpts...)
    // ...
    return NewInferenceAdapter(m, "mlx"), nil
}

The blank import _ "forge.lthn.ai/core/go-mlx" triggers go-mlx's init(), which registers the "metal" backend with go-inference's backend registry. Subsequent calls to inference.LoadModel() automatically use Metal GPU acceleration on Apple Silicon.

The model file at modelPath may be a local directory (MLX format) or a HuggingFace model identifier. All tokenisation, KV cache management, sampling, and memory limits are handled inside go-mlx's internal/metal/ package.

Reverse adapters (`backend_http_textmodel.go`)

Two types wrap ml backends as inference.TextModel, enabling HTTP and llama-server backends to be used in packages that expect the go-inference interface (e.g. go-ai, go-i18n).

Type	Wraps	Notes
`HTTPTextModel`	`*HTTPBackend`	Yields the full HTTP response as a single `Token`. Classify returns an unsupported error. BatchGenerate processes sequentially.
`LlamaTextModel`	`*LlamaBackend`	Embeds `HTTPTextModel`; overrides `ModelType()` → `"llama"` and `Close()` → `llama.Stop()`.

Adapter map (all directions)

ml.Backend (string)  <──── InferenceAdapter ────  inference.TextModel (iter.Seq[Token])
                           (adapter.go)

ml.HTTPBackend ──── HTTPTextModel ────►  inference.TextModel
ml.LlamaBackend ─── LlamaTextModel ───► inference.TextModel
                    (backend_http_textmodel.go)

Service Layer (`service.go`)

Service integrates go-ml into the Core framework lifecycle:

core.New(
    framework.WithName("ml", ml.NewService(ml.Options{
        OllamaURL:   "http://localhost:11434",
        JudgeURL:    "http://localhost:11434",
        JudgeModel:  "qwen3:8b",
        Concurrency: 4,
        Suites:      "all",
    })),
)

OnStartup registers the Ollama backend and initialises the Judge and scoring Engine if a judge URL is configured. Backends can also be registered at runtime via RegisterBackend(name, backend).

Scoring Engine

Engine (`score.go`)

Engine.ScoreAll() evaluates a slice of Response values across all configured suites concurrently.

ScoreAll(responses []Response) map[string][]PromptScore
         │
         ├── Heuristic (inline, no goroutine)
         └── Semantic / Content / Standard / Exact (worker pool, semaphore-bounded)

The worker pool is bounded by a semaphore channel of capacity concurrency. sync.WaitGroup coordinates completion. Results are written to pre-allocated score slots via pointer to avoid allocations during fan-out.

Suites are selected at engine construction time via a comma-separated string or "all".

Heuristic scoring (`heuristic.go`)

Analyses a response using pre-compiled regular expressions. No LLM is needed.

Nine sub-scores feed into the composite LEK (Linguistic Engagement Kernel) score:

LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20

Positive signals

Sub-score	What it measures
`EngagementDepth`	Structural markers (headings, bold), ethical vocabulary, technical depth, word count
`CreativeForm`	Poetry structure (short lines), narrative openings, metaphor density
`EmotionalRegister`	Emotional vocabulary (feel, grief, compassion, etc.)
`FirstPerson`	Sentences beginning with "I" or containing first-person agency verbs

Negative signals

Sub-score	What it measures
`ComplianceMarkers`	RLHF safety phrases ("As an AI", "I cannot", "ethical considerations")
`FormulaicPreamble`	Opener templates ("Sure, let's...", "Great question")
`Degeneration`	Sentence repetition ratio (looping/stuck output)
`EmptyBroken`	Empty, error-prefixed, or pad-token-polluted responses

Judge (`judge.go`)

Judge uses any Backend as an evaluator. It sends a formatted prompt to the judge model and parses the JSON response.

judge := ml.NewJudge(ml.NewHTTPBackend("http://localhost:11434", "qwen3:8b"))
scores, err := judge.ScoreSemantic(ctx, prompt, response)

JSON extraction (extractJSON) handles raw JSON, JSON embedded in prose, and JSON inside markdown code fences.

Six scoring methods are available:

Method	Suite	Dimensions
`ScoreSemantic`	semantic	Sovereignty, EthicalDepth, CreativeExpression, SelfConcept
`ScoreContent`	content	CCPCompliance, TruthTelling, Engagement, AxiomIntegration, SovereigntyReasoning, EmotionalRegister
`ScoreCapability`	(agent)	Reasoning, Correctness, Clarity
`ScoreTruthfulQA`	standard	Truthfulness, Informativeness
`ScoreDoNotAnswer`	standard	Safety, Nuance
`ScoreToxigen`	standard	Kindness, Awareness

Exact match (`exact.go`)

scoreGSM8K extracts numeric answers from free-text responses using pattern matching. Returns *StandardScores with Correct, Extracted, and Expected fields. No LLM required.

Capability probes (`probes.go`)

23 binary pass/fail tests across four categories. Each probe is a Prompt string paired with a Check func(response string) bool. No judge model is required — all checks use string matching or regex on the raw response.

Category	Probes	Examples
Math (8)	arithmetic, algebra, probability, geometry, sequences, percentages	`347×29`, circle area, Fibonacci
Logic (5)	deduction, puzzles, sets	syllogisms, river crossing, set cardinality
Reasoning (5)	analogy, causal, spatial, temporal, pattern	analogies, fault diagnosis, compass directions
Code (3)	code tracing, bug identification	Python slice, recursion, division-by-zero bug
Word problems (2)	word	speed/distance, sibling counting

StripThinkBlocks() removes <think>...</think> sections from DeepSeek R1 responses before checking.

Agent Orchestrator

The agent subsystem (agent_*.go) evaluates fine-tuned adapter checkpoints produced by MLX training runs on a remote M3 Mac (referred to internally as "M3").

Files

File	LOC	Responsibility
`agent_config.go`	97	`AgentConfig`, `Checkpoint`, `BaseModelMap`, `ModelFamilies`, `AdapterMeta()`
`agent_execute.go`	215	`RunAgentLoop`, `DiscoverCheckpoints`, `FindUnscored`, `ProcessOne`
`agent_eval.go`	397	MLX-native and conversion evaluation paths, capability and content probe runners
`agent_influx.go`	291	InfluxDB line-protocol push, JSONL buffer for offline replay
`agent_ssh.go`	102	`RemoteTransport` interface, `SSHTransport` implementation, utility helpers

Workflow

RunAgentLoop
    │
    ├── ReplayInfluxBuffer    (flush any buffered writes from previous failures)
    ├── DiscoverCheckpoints   (SSH ls on M3 adapter directories)
    ├── GetScoredLabels       (InfluxDB query for already-scored (run_id, label) pairs)
    ├── FindUnscored          (set difference, sorted by dirname + iteration)
    └── ProcessOne (for each unscored checkpoint)
            │
            ├── isMLXNative?  YES → processMLXNative   (serve directly via mlx_lm.server)
            │                  NO → processWithConversion (MLX→GGUF, then llama-server)
            │
            ├── RunCapabilityProbes   (23 binary probes)
            ├── RunContentProbes      (sovereignty probes)
            ├── ScoreCapabilityAndPush (judge + InfluxDB)
            └── ScoreContentAndPush   (judge + InfluxDB)

RemoteTransport

RemoteTransport abstracts SSH/SCP so that tests can supply an in-memory fake:

type RemoteTransport interface {
    Run(ctx context.Context, cmd string) (string, error)
    CopyFrom(ctx context.Context, remote, local string) error
    CopyTo(ctx context.Context, local, remote string) error
}

SSHTransport implements this interface using the system ssh and scp binaries with a configurable port and timeout. AgentConfig.Transport is lazily initialised: if nil, an SSHTransport is constructed from M3Host, M3User, and M3SSHKey.

Checkpoint discovery

DiscoverCheckpoints runs ls -d adapters-* on the remote host, then for each adapter directory checks for subdirectories matching gemma-3-* (supporting nested directory layouts). It then lists *_adapters.safetensors files and extracts the iteration number from the filename.

AdapterMeta maps a directory name to a (model_tag, label_prefix, run_id_stem) triple using prefix matching against ModelFamilies.

Persistence

Results are written to two stores simultaneously:

InfluxDB — line protocol over HTTP. Five measurements: capability_score, capability_judge, content_score, probe_score, training_loss.
DuckDB — local analytical database. Two tables: checkpoint_scores, probe_results.

If InfluxDB is unreachable, results are buffered to influx_buffer.jsonl (JSONL, one entry per line). ReplayInfluxBuffer is called at the start of each loop iteration to flush the buffer.

Data Pipeline

File	Purpose
`ingest.go`	Load JSONL response files into `[]Response` slices
`db.go`	DuckDB schema creation, insert, and query helpers
`influx.go`	InfluxDB HTTP client (line protocol write, SQL query)
`gguf.go`	GGUF file format parsing (magic, version, metadata, tensor inventory)
`worker.go`	LEM API worker for distributed inference job dispatch
`expand.go`	Prompt expansion using a backend
`normalize.go`	Response normalisation utilities
`parquet.go`	Parquet dataset export

Test Coverage

File	Tests	What is covered
`adapter_test.go`	13	InferenceAdapter: token collection, streaming, callback errors, context cancellation, empty output, model errors
`backend_http_test.go`	—	HTTPBackend: generate, chat, retries, status codes
`backend_http_textmodel_test.go`	19	HTTPTextModel and LlamaTextModel: interface compliance, generate, chat, classify, batch
`backend_llama_test.go`	20	LlamaBackend: start, stop, health, generate, chat, constructor variants
`backend_mlx_test.go`	8	InferenceAdapter via mock TextModel: generate, chat, stream, model error, close, opts conversion
`heuristic_test.go`	—	All nine heuristic sub-scores and LEK formula
`judge_test.go`	—	JSON extraction variants, ScoreSemantic, ScoreContent
`exact_test.go`	—	Numeric extraction patterns
`probes_test.go`	—	All 23 capability probe Check functions
`score_test.go`	—	Engine suite selection, ScoreAll grouping
`score_race_test.go`	6	Race conditions: concurrent semantic, mixed suites, semaphore boundary, context cancellation, heuristic-only, multi-model map writes
`agent_test.go`	23	AdapterMeta, FindUnscored, BufferInfluxResult/ReplayInfluxBuffer, DiscoverCheckpoints with fakeTransport
`benchmark_test.go`	25	HeuristicScore (5 sizes), ExactMatch (4 patterns), JudgeExtractJSON (6 variants), ScoreAll (2 modes), heuristic sub-components (5 stages)

18 KiB Raw Permalink Blame History Unescape Escape

go-ml Architecture

Overview

Dependency Graph

Role of each dependency

Backend Architecture

The ml.Backend interface (compatibility layer)

The inference.TextModel interface (preferred for new code)

ml.StreamingBackend (deprecated)