docs: graduate TODO/FINDINGS into production documentation

Replace internal task tracking (TODO.md, FINDINGS.md) with structured documentation in docs/. Trim CLAUDE.md to agent instructions only. Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:01:55 +00:00 · 2026-02-20 15:01:55 +00:00 · 74b3e7a53d
commit 74b3e7a53d
parent c91e30599c
6 changed files with 710 additions and 118 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -4,7 +4,7 @@

 Shared inference interfaces for the Core Go ecosystem. Module: `forge.lthn.ai/core/go-inference`

-This package defines the contract between GPU-specific backends (go-mlx on macOS, go-rocm on Linux) and consumers (go-ml, go-ai, go-i18n). It has **zero dependencies** and compiles on all platforms.
+Zero dependencies. Compiles on all platforms. See `docs/architecture.md` for design rationale.

 ## Commands

@ -13,64 +13,34 @@ go test ./...        # Run all tests
 go vet ./...         # Vet
 ```

-## Architecture
+## Stability Rules

-```
-go-inference (this package) ← defines TextModel, Backend, Token, Message
-    ↑                    ↑
-    │                    │
-go-mlx (darwin/arm64)   go-rocm (linux/amd64)
-    │                    │
-    └────── go-ml ───────┘   (wraps backends into scoring engine)
-             ↑
-          go-ai (MCP hub)
-```
+This package is the shared contract. Changes here affect go-mlx, go-rocm, and go-ml simultaneously.

-### Key Types
-
-| Type | Purpose |
-|------|---------|
-| `TextModel` | Core interface: Generate, Chat, Err, Close |
-| `Backend` | Named engine that can LoadModel → TextModel |
-| `Token` | Streaming token (ID + Text) |
-| `Message` | Chat message (Role + Content) |
-| `GenerateOption` | Functional option for generation (temp, topK, etc.) |
-| `LoadOption` | Functional option for model loading (backend, GPU layers, etc.) |
-
-### Backend Registry
-
-Backends register via `init()` with build tags. Consumers call `LoadModel()` which auto-selects the best available backend:
-
-```go
-// Auto-detect: Metal on macOS, ROCm on Linux
-m, err := inference.LoadModel("/path/to/model/")
-
-// Explicit backend
-m, err := inference.LoadModel("/path/", inference.WithBackend("rocm"))
-```
+- Never change existing method signatures on `TextModel` or `Backend`
+- Only add methods when two or more consumers need them
+- Prefer new interfaces that embed `TextModel` over extending `TextModel` itself
+- New fields on `GenerateConfig` or `LoadConfig` are safe (zero-value defaults)
+- All new interface methods require Virgil approval before merging

 ## Coding Standards

 - UK English
- Zero external dependencies — stdlib only
- Tests: testify assert/require
- Conventional commits
+- Zero external dependencies — stdlib only (testify permitted in tests)
+- Conventional commits: `type(scope): description`
 - Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
 - Licence: EUPL-1.2

 ## Consumers

- **go-mlx**: Implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
- **go-rocm**: Implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
- **go-ml**: Wraps inference backends into scoring engine, adds llama.cpp HTTP backend
+- **go-mlx**: implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
+- **go-rocm**: implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
+- **go-ml**: wraps inference backends into scoring engine, adds llama.cpp HTTP backend
 - **go-ai**: MCP hub, exposes inference via MCP tools
- **go-i18n**: Uses TextModel for Gemma3-1B domain classification
+- **go-i18n**: uses `TextModel` for Gemma3-1B domain classification

-## Stability
+## Documentation

-This package is the shared contract. Changes here affect all backends and consumers. Keep the interface minimal and stable. Add new methods only when two or more consumers need them.
-
-## Task Queue
-
-See `TODO.md` for prioritised work.
-See `FINDINGS.md` for research notes.
+- `docs/architecture.md` — interfaces, registry, options, design decisions
+- `docs/development.md` — prerequisites, build, test patterns, coding standards
+- `docs/history.md` — completed phases, commit log, known limitations
--- a/FINDINGS.md
+++ b/FINDINGS.md
@ -1,33 +0,0 @@
-# FINDINGS.md — go-inference Research & Discovery
-
---
-
-## 2026-02-19: Package Creation (Virgil)
-
-### Motivation
-
-go-mlx (darwin/arm64) and go-rocm (linux/amd64) both need to implement the same TextModel interface, but go-rocm can't import go-mlx (platform-specific CGO dependency). A shared interface package solves this.
-
-### Alternatives Considered
-
-1. **Duplicate interfaces** — Each backend defines its own TextModel. Simple but diverges over time as backends evolve independently. Rejected.
-2. **Shared interface package** (chosen) — `core/go-inference` defines the contract. ~100 LOC, zero deps, compiles everywhere.
-3. **Define in go-ml** — go-ml already has Backend/StreamingBackend. But go-ml has heavy deps (DuckDB, Parquet) that backends shouldn't import. Rejected.
-
-### Interface Design Decisions
-
- **`context.Context` on Generate/Chat**: Required for HTTP handler cancellation, timeouts, graceful shutdown. go-ml's current backend_mlx.go already uses ctx.
- **`Err() error` on TextModel**: iter.Seq can't carry errors. Consumers check Err() after the iterator stops. Pattern matches database/sql Row.Err().
- **`Chat()` on TextModel**: Models own their chat templates (Gemma3, Qwen3, Llama3 all have different formats). Keeping templates in consumers means every consumer duplicates model-specific formatting.
- **`Available() bool` on Backend**: Needed for Default() to skip unavailable backends (e.g. ROCm registered but no GPU present).
- **`GPULayers` in LoadConfig**: ROCm/llama.cpp support partial GPU offload. Metal always does full offload. Default -1 = all layers.
- **`RepeatPenalty` in GenerateConfig**: llama.cpp backends use this heavily. Metal backends can ignore it.
-
-### Consumer Mapping
-
-| Consumer | What it imports | How it uses TextModel |
-|----------|----------------|----------------------|
-| go-ml | go-inference | Wraps TextModel into its own Backend interface, adds scoring |
-| go-ai | go-inference (via go-ml) | Exposes via MCP tools |
-| go-i18n | go-inference | Direct: LoadModel → Generate(WithMaxTokens(1)) for classification |
-| LEM Lab | go-inference (via go-ml) | Chat streaming for web UI |
--- a/TODO.md
+++ b/TODO.md
@ -1,37 +0,0 @@
-# TODO.md — go-inference Task Queue
-
-Dispatched from core/go orchestration. This package is minimal by design.
-
---
-
-## Phase 1: Foundation — `d76448d` (Charon)
-
- [x] **Add tests for option application** — Verify GenerateConfig defaults, all With* options, ApplyGenerateOpts/ApplyLoadOpts behaviour. Comprehensive API tests (1,074 LOC).
- [x] **Add tests for backend registry** — Register, Get, List, Default priority order, LoadModel routing.
- [x] **Add tests for Default() platform preference** — Verify metal > rocm > llama_cpp ordering.
-
-## Phase 2: Integration — COMPLETE
-
- [x] **go-mlx migration** — `register_metal.go` implements `inference.Backend` via `metalBackend{}` + `metalAdapter{}` wrapping `internal/metal.Model`. Auto-registers via `inference.Register()` in `init()`. Build-tagged `darwin && arm64`. Full TextModel coverage: Generate, Chat, Classify, BatchGenerate, Info, Metrics, Err, Close.
- [x] **go-rocm implementation** — `register_rocm.go` implements `inference.Backend` + `inference.TextModel` via llama-server subprocess. Auto-registers via `inference.Register(&rocmBackend{})`. Phase 4 complete (5,794 LOC by Charon).
- [x] **go-ml migration** — `adapter.go` bridges `inference.TextModel` → `ml.Backend/StreamingBackend` (118 LOC, 13 tests). `backend_mlx.go` collapsed from 253 to 35 LOC using `inference.LoadModel`. `backend_http_textmodel.go` provides reverse wrappers (135 LOC, 19 tests).
-
-## Phase 3: Extended Interfaces (when needed)
-
- [ ] **BatchModel interface** — When go-i18n needs 5K sentences/sec, add: `type BatchModel interface { TextModel; BatchGenerate(ctx, []string, ...GenerateOption) iter.Seq2[int, Token] }`. Not before it's needed.
- [ ] **Stats interface** — When LEM Lab dashboard needs metrics: `type StatsModel interface { TextModel; Stats() GenerateStats }` with tokens/sec, peak memory, GPU util.
-
---
-
-## Design Principles
-
-1. **Minimal interface** — Only add methods when 2+ consumers need them
-2. **Zero dependencies** — stdlib only, compiles everywhere
-3. **Backwards compatible** — New interfaces extend, never modify existing ones
-4. **Platform agnostic** — No build tags, no CGO, no OS-specific code
-
-## Workflow
-
-1. Virgil in core/go manages this package directly (too small for a dedicated Claude)
-2. Changes here are coordinated with go-mlx and go-rocm Claudes via their TODO.md
-3. New interface methods require Virgil approval before adding
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,302 @@
+# Architecture — go-inference
+
+## Purpose
+
+`go-inference` is the shared interface contract for text generation backends in the Core Go ecosystem. It defines the types that GPU-specific backends implement and consumers depend on, without itself importing any backend or consumer code.
+
+Module path: `forge.lthn.ai/core/go-inference`
+
+## Design Philosophy
+
+### Zero Dependencies
+
+The package imports only the Go standard library (`context`, `fmt`, `iter`, `sync`, `time`, `encoding/json`, `os`, `path/filepath`). The sole exception is `testify` in the test tree.
+
+This is a deliberate constraint. The package sits at the base of a dependency graph where:
+
+- `go-mlx` pulls in CGO bindings against Apple's Metal framework
+- `go-rocm` spawns a `llama-server` subprocess with AMD ROCm libraries
+- `go-ml` links DuckDB and Parquet
+
+None of those concerns belong in the interface layer. A backend can import `go-inference`; `go-inference` cannot import a backend. A consumer can import `go-inference`; `go-inference` cannot import a consumer.
+
+### Minimal Interface Surface
+
+New methods are only added when two or more existing consumers need them. The interfaces are deliberately narrow. Broader capability is achieved through additional interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, not through extending `TextModel` itself.
+
+### Platform Agnostic
+
+No build tags, no `//go:build` constraints, no `CGO_ENABLED` requirements appear in this package. It compiles cleanly on macOS, Linux, and Windows regardless of GPU availability.
+
+## Ecosystem Position
+
+```
+go-inference (this package)  ← defines TextModel, Backend, Token, Message
+      |
+      |──────── implemented by ──────────────────────────────
+      |                                                      |
+ go-mlx                                                 go-rocm
+ (darwin/arm64, Metal GPU)                     (linux/amd64, AMD ROCm)
+      |                                                      |
+      └───────────────── consumed by ────────────────────────┘
+                              |
+                           go-ml
+                    (scoring engine, llama.cpp HTTP)
+                              |
+                           go-ai
+                     (MCP hub, 30+ tools)
+                              |
+                          go-i18n
+                   (domain classification via Gemma3-1B)
+```
+
+`go-ml` also provides a reverse adapter (`backend_http_textmodel.go`) that wraps an HTTP llama.cpp server as a `TextModel`, giving a third backend path without Metal or ROCm.
+
+## Core Types
+
+### Token
+
+```go
+type Token struct {
+    ID   int32
+    Text string
+}
+```
+
+The atomic unit of streaming output. `ID` is the vocabulary index; `Text` is the decoded string. Backends yield these through `iter.Seq[Token]`.
+
+### Message
+
+```go
+type Message struct {
+    Role    string `json:"role"`    // "system", "user", "assistant"
+    Content string `json:"content"`
+}
+```
+
+A single turn in a multi-turn conversation. JSON tags are present for serialisation through MCP tool payloads and API responses.
+
+### ClassifyResult
+
+```go
+type ClassifyResult struct {
+    Token  Token
+    Logits []float32
+}
+```
+
+Output from a single prefill-only forward pass. `Logits` is populated only when `WithLogits()` is set; it is empty by default to avoid allocating vocab-sized float arrays for every classification call.
+
+### BatchResult
+
+```go
+type BatchResult struct {
+    Tokens []Token
+    Err    error
+}
+```
+
+Per-prompt result from `BatchGenerate`. `Err` carries per-prompt failures (context cancellation, OOM) rather than aborting the entire batch.
+
+### GenerateMetrics
+
+```go
+type GenerateMetrics struct {
+    PromptTokens        int
+    GeneratedTokens     int
+    PrefillDuration     time.Duration
+    DecodeDuration      time.Duration
+    TotalDuration       time.Duration
+    PrefillTokensPerSec float64
+    DecodeTokensPerSec  float64
+    PeakMemoryBytes     uint64
+    ActiveMemoryBytes   uint64
+}
+```
+
+Performance data for the most recent inference operation. Retrieved via `TextModel.Metrics()` after an iterator is exhausted or a batch call returns. `PeakMemoryBytes` and `ActiveMemoryBytes` are GPU-specific; CPU-only backends may leave them at zero.
+
+### ModelInfo
+
+```go
+type ModelInfo struct {
+    Architecture string
+    VocabSize    int
+    NumLayers    int
+    HiddenSize   int
+    QuantBits    int
+    QuantGroup   int
+}
+```
+
+Static metadata about a loaded model. `QuantBits` is zero for unquantised (FP16/BF16) models.
+
+## TextModel Interface
+
+```go
+type TextModel interface {
+    Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
+    Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
+    Classify(ctx context.Context, prompts []string, opts ...GenerateOption) ([]ClassifyResult, error)
+    BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) ([]BatchResult, error)
+    ModelType() string
+    Info() ModelInfo
+    Metrics() GenerateMetrics
+    Err() error
+    Close() error
+}
+```
+
+Key design decisions:
+
+**`context.Context` on streaming methods** — Required for HTTP handler cancellation, request timeouts, and graceful shutdown. The context is checked by backends at token boundaries.
+
+**`iter.Seq[Token]` return type** — Go 1.23+ range-over-function iterators. The caller ranges over the sequence; the backend controls token production. The iterator pattern avoids channel overhead and lets the backend use direct memory access to GPU buffers.
+
+**`Err() error`** — `iter.Seq` cannot carry errors alongside values. Following the `database/sql` `Row.Err()` pattern, the error from the most recent `Generate` or `Chat` call is stored internally and retrieved with `Err()` after the iterator finishes. End-of-sequence (EOS token) sets no error; context cancellation and OOM both set one.
+
+**`Chat()` on the model** — Chat templates differ across architectures (Gemma3, Qwen3, Llama3 all use distinct formats). Placing template application in the backend means consumers receive already-formatted input regardless of model family. If templates lived in consumers, every consumer would need to duplicate model-specific formatting logic.
+
+**`Classify()` and `BatchGenerate()`** — Two distinct batch operations with different performance characteristics. `Classify` is prefill-only (single forward pass, no autoregressive loop); it is the fast path for domain labelling in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel.
+
+**`Info()` and `Metrics()`** — Separated from `Generate`/`Chat` because they serve different call sites. `Info()` is called once after load; `Metrics()` is called after each inference operation for performance monitoring.
+
+## Backend Interface
+
+```go
+type Backend interface {
+    Name() string
+    LoadModel(path string, opts ...LoadOption) (TextModel, error)
+    Available() bool
+}
+```
+
+**`Name()`** — Returns the registry key: `"metal"`, `"rocm"`, or `"llama_cpp"`. This is the string passed to `WithBackend()` by consumers.
+
+**`LoadModel()`** — Accepts a filesystem path to a model directory (containing `config.json` and `.safetensors` weight files) and returns a ready-to-use `TextModel`. The model directory format follows the HuggingFace safetensors layout.
+
+**`Available()`** — Reports whether the backend can run on the current hardware. This allows a backend to be registered unconditionally (e.g. in a shared binary) while still reporting false on platforms where its GPU runtime is absent. `Default()` skips unavailable backends.
+
+## Backend Registry
+
+The registry is a package-level `map[string]Backend` protected by a `sync.RWMutex`. It supports concurrent reads and exclusive writes.
+
+```go
+var (
+    backendsMu sync.RWMutex
+    backends   = map[string]Backend{}
+)
+```
+
+**Registration** — Backends call `inference.Register(b Backend)` from their `init()` function. The `init()` is guarded by a build tag so it only compiles on the target platform:
+
+```go
+// In go-mlx: register_metal.go
+//go:build darwin && arm64
+
+func init() { inference.Register(metalBackend{}) }
+```
+
+```go
+// In go-rocm: register_rocm.go
+//go:build linux && amd64
+
+func init() { inference.Register(&rocmBackend{}) }
+```
+
+Registering a name that already exists silently overwrites the previous entry. This allows test code to replace backends without a separate de-registration step.
+
+**Discovery** — `Get(name)` performs a direct map lookup. `List()` returns all registered names (order undefined). `Default()` walks a priority list:
+
+```go
+for _, name := range []string{"metal", "rocm", "llama_cpp"} {
+    if b, ok := backends[name]; ok && b.Available() {
+        return b, nil
+    }
+}
+// Fall back to any registered available backend.
+```
+
+The priority order encodes hardware preference: Metal (Apple Silicon) delivers the highest throughput for on-device inference on macOS; ROCm is preferred over llama.cpp's HTTP server on Linux because it provides direct GPU memory access without HTTP overhead.
+
+**`LoadModel()` routing** — The top-level `LoadModel()` function is the primary consumer entry point:
+
+```go
+func LoadModel(path string, opts ...LoadOption) (TextModel, error) {
+    cfg := ApplyLoadOpts(opts)
+    if cfg.Backend != "" {
+        b, ok := Get(cfg.Backend)
+        // ... validate and use explicit backend
+    }
+    b, err := Default()
+    // ... use auto-selected backend
+}
+```
+
+Passing `WithBackend("rocm")` bypasses `Default()` entirely. This is the mechanism used in cross-platform binaries or tests that need to pin a specific backend.
+
+## Functional Options
+
+Generation and loading are configured through two independent option types, both following the standard Go functional options pattern.
+
+### GenerateConfig and GenerateOption
+
+```go
+type GenerateConfig struct {
+    MaxTokens     int
+    Temperature   float32
+    TopK          int
+    TopP          float32
+    StopTokens    []int32
+    RepeatPenalty float32
+    ReturnLogits  bool
+}
+```
+
+Defaults (from `DefaultGenerateConfig()`): `MaxTokens=256`, `Temperature=0.0` (greedy), all others zero/disabled.
+
+`ApplyGenerateOpts(opts []GenerateOption) GenerateConfig` is called by backends at the start of each inference operation. Options are applied in order; the last write wins for scalar fields.
+
+`WithLogits()` is a flag rather than a value option because logit arrays are vocab-sized (256,128 floats for Gemma3) and should only be allocated when explicitly requested.
+
+### LoadConfig and LoadOption
+
+```go
+type LoadConfig struct {
+    Backend       string
+    ContextLen    int
+    GPULayers     int
+    ParallelSlots int
+}
+```
+
+Default `GPULayers` is `-1`, meaning full GPU offload. `0` forces CPU-only inference. Positive values specify a layer count for partial offload (relevant to ROCm and llama.cpp; Metal always does full offload).
+
+`ParallelSlots` controls the number of concurrent inference slots the backend allocates. Higher values allow parallel `Generate`/`Chat` calls at the cost of increased VRAM usage. `0` defers to the backend's own default.
+
+## Model Discovery
+
+`Discover(baseDir string) ([]DiscoveredModel, error)` scans one level of a directory tree for model directories. A valid model directory must contain both `config.json` and at least one `.safetensors` file.
+
+```go
+type DiscoveredModel struct {
+    Path      string
+    ModelType string
+    QuantBits int
+    QuantGroup int
+    NumFiles   int
+}
+```
+
+`Path` is always an absolute filesystem path. `ModelType` is read from `config.json`'s `model_type` field. Invalid JSON in `config.json` is silently tolerated — the directory is included with an empty `ModelType`.
+
+`Discover` also checks whether `baseDir` itself is a model directory and, if so, prepends it to the result so that direct-path usage (`Discover("/models/gemma3-1b")`) works without nesting.
+
+## Stability Contract
+
+This package is the shared contract. Every method signature change here requires coordinated updates to go-mlx, go-rocm, and go-ml. The following rules govern interface evolution:
+
+1. Existing method signatures are never changed. Rename or remove nothing from `TextModel` or `Backend`.
+2. New methods are only added when two or more consumers have a concrete need.
+3. New capability is expressed as separate interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, allowing consumers to opt in with a type assertion.
+4. `GenerateConfig` and `LoadConfig` may gain new fields with zero-value defaults; this is backwards compatible.
--- a/docs/development.md
+++ b/docs/development.md
@ -0,0 +1,253 @@
+# Development Guide — go-inference
+
+## Prerequisites
+
+- Go 1.25 or later (uses `iter.Seq` from Go 1.23 and range-over-function from 1.22)
+- No CGO, no build tags, no external tools required
+- The package compiles on macOS, Linux, and Windows without modification
+
+## Commands
+
+```bash
+# Run all tests
+go test ./...
+
+# Run a single test by name
+go test -run TestDefault_Good_Metal ./...
+
+# Vet for common mistakes
+go vet ./...
+
+# View test coverage
+go test -coverprofile=coverage.out ./...
+go tool cover -html=coverage.out
+```
+
+There is no Taskfile in this package; it is small enough that direct `go` invocations suffice. The parent workspace (`/Users/snider/Code/host-uk/core`) uses Task for cross-repo operations.
+
+## Go Workspace
+
+This package is part of the `host-uk/core` Go workspace. After adding or changing module dependencies:
+
+```bash
+go work sync
+```
+
+The workspace root is `/Users/snider/Code/host-uk/core`. The workspace file (`go.work`) includes this module alongside `cmd/core-gui`, `cmd/bugseti`, and others.
+
+## Module Path
+
+```
+forge.lthn.ai/core/go-inference
+```
+
+Import it in consumers:
+
+```go
+import "forge.lthn.ai/core/go-inference"
+```
+
+Remote: `ssh://git@forge.lthn.ai:2223/core/go-inference.git`
+
+## Repository Layout
+
+```
+go-inference/
+├── inference.go        # TextModel, Backend, Token, Message, registry, LoadModel
+├── options.go          # GenerateConfig, LoadConfig, all With* options
+├── discover.go         # Discover() and DiscoveredModel
+├── inference_test.go   # Tests for registry, LoadModel, all types
+├── options_test.go     # Tests for GenerateConfig, LoadConfig, all options
+├── discover_test.go    # Tests for Discover()
+├── go.mod
+├── go.sum
+├── CLAUDE.md           # Agent instructions
+├── README.md
+└── docs/
+    ├── architecture.md
+    ├── development.md
+    └── history.md
+```
+
+## Test Patterns
+
+Tests follow the `_Good`, `_Bad`, `_Ugly` suffix convention used across the Core Go ecosystem:
+
+- `_Good` — happy path; confirms the documented behaviour works correctly
+- `_Bad` — expected error conditions; confirms errors are returned with useful messages
+- `_Ugly` — edge cases, panics, surprising-but-valid behaviour (e.g. last-option-wins, registry overwrites)
+
+```go
+func TestDefault_Good_Metal(t *testing.T) { ... }
+func TestDefault_Bad_NoBackends(t *testing.T) { ... }
+func TestDefault_Ugly_SkipsUnavailablePreferred(t *testing.T) { ... }
+```
+
+### Backend Registry Isolation
+
+Tests that touch the global backend registry call `resetBackends(t)` first. This helper clears the map and is defined in `inference_test.go`:
+
+```go
+func resetBackends(t *testing.T) {
+    t.Helper()
+    backendsMu.Lock()
+    defer backendsMu.Unlock()
+    backends = map[string]Backend{}
+}
+```
+
+Because `resetBackends` is in the `inference` package (not `inference_test`), it has direct access to the unexported `backends` map. Tests must not rely on registration order across test functions; each test that uses the registry must call `resetBackends` at the top.
+
+### Stub Implementations
+
+`inference_test.go` provides `stubBackend` and `stubTextModel` — minimal implementations of `Backend` and `TextModel` for use in registry and routing tests. These are in the `inference` package itself (not a separate `_test` package) to allow access to unexported fields.
+
+When writing new tests, use the existing stubs rather than creating new ones unless you need behaviour the stubs do not support.
+
+### Table-Driven Tests
+
+Prefer table-driven tests for options and configuration variants. The existing `TestApplyGenerateOpts_Good`, `TestWithTemperature_Good`, and `TestDefault_Good_PriorityOrder` tests demonstrate the pattern:
+
+```go
+tests := []struct {
+    name string
+    val  float32
+    want float32
+}{
+    {"greedy", 0.0, 0.0},
+    {"low", 0.3, 0.3},
+}
+for _, tt := range tests {
+    t.Run(tt.name, func(t *testing.T) {
+        cfg := ApplyGenerateOpts([]GenerateOption{WithTemperature(tt.val)})
+        assert.InDelta(t, tt.want, cfg.Temperature, 0.0001)
+    })
+}
+```
+
+### Assertions
+
+Use `testify/assert` and `testify/require`:
+
+- `require` for preconditions where failure makes subsequent assertions meaningless (e.g. `require.NoError(t, err)` before using the returned value)
+- `assert` for all other checks
+- `assert.InDelta` for float32/float64 comparisons (never `==`)
+
+## Coding Standards
+
+### Language
+
+UK English throughout: colour, organisation, centre, licence (noun), serialise, recognise. American spellings are not accepted in comments, documentation, or error messages.
+
+### Formatting
+
+Standard `gofmt` formatting. No custom style rules. Run `gofmt -w .` or `go fmt ./...` before committing.
+
+### Error Messages
+
+Error strings start with the package name and a colon, lowercase, no trailing period:
+
+```go
+fmt.Errorf("inference: no backends registered (import a backend package)")
+fmt.Errorf("inference: backend %q not registered", cfg.Backend)
+fmt.Errorf("inference: backend %q not available on this hardware", cfg.Backend)
+```
+
+This convention matches the Go standard library and makes `errors.Is`/`errors.As` wrapping straightforward.
+
+### Strict Types
+
+All parameters and return types are explicitly typed. No `interface{}` or `any` outside of test helpers where unavoidable.
+
+### Dependencies
+
+No new external dependencies may be added to the production code. The `go.mod` `require` block must remain stdlib-only for non-test code. `testify` is the only permitted test dependency.
+
+If you find yourself wanting an external library, reconsider the approach. This package is intentionally minimal.
+
+### Licence Header
+
+Every new `.go` file must carry the EUPL-1.2 licence header:
+
+```go
+// Copyright (c) Lethean Technologies Ltd. All rights reserved.
+// SPDX-License-Identifier: EUPL-1.2
+```
+
+Existing files without this header will be updated in a future housekeeping pass.
+
+## Commit Guidelines
+
+Use conventional commits:
+
+```
+type(scope): short imperative description
+
+Longer explanation if needed. UK English. Wrap at 72 characters.
+```
+
+Types: `feat`, `fix`, `test`, `docs`, `refactor`, `chore`
+
+Scope: `inference`, `options`, `discover`, or omit for cross-cutting changes.
+
+Examples:
+
+```
+feat(inference): add WithParallelSlots load option
+fix(discover): handle config.json with invalid JSON gracefully
+test(options): add table-driven tests for WithTopP
+docs: expand architecture section on registry priority
+```
+
+Always include the co-author trailer:
+
+```
+Co-Authored-By: Virgil <virgil@lethean.io>
+```
+
+## Implementing a Backend
+
+To implement a new backend (e.g. `go-vulkan` for cross-platform GPU inference):
+
+1. Import `forge.lthn.ai/core/go-inference` in the new module.
+2. Implement `inference.Backend`:
+
+```go
+type vulkanBackend struct{}
+
+func (b *vulkanBackend) Name() string { return "vulkan" }
+
+func (b *vulkanBackend) Available() bool {
+    // Check whether Vulkan runtime is present on this host.
+    return vulkan.IsAvailable()
+}
+
+func (b *vulkanBackend) LoadModel(path string, opts ...inference.LoadOption) (inference.TextModel, error) {
+    cfg := inference.ApplyLoadOpts(opts)
+    // Load model using cfg.ContextLen, cfg.GPULayers, etc.
+    return &vulkanModel{...}, nil
+}
+```
+
+3. Implement `inference.TextModel` (all nine methods).
+4. Register in `init()`, guarded by the appropriate build tag:
+
+```go
+//go:build linux && (amd64 || arm64)
+
+func init() { inference.Register(&vulkanBackend{}) }
+```
+
+5. Write stub-based tests to confirm the backend registers and `LoadModel` routes correctly without requiring real GPU hardware in CI.
+
+## Extending the Interface
+
+Before adding a method to `TextModel` or `Backend`, consider:
+
+- Do two or more existing consumers require this capability right now?
+- Can the capability be expressed as a separate interface that embeds `TextModel`?
+- Will adding this method break existing backend implementations that do not yet provide it?
+
+If the answer to the first question is no, defer the addition. If a separate interface is sufficient, prefer that approach. See `docs/architecture.md` for the stability contract.
+
+When a new method is genuinely necessary, coordinate with the owners of go-mlx, go-rocm, and go-ml before merging, since all three must implement the new method simultaneously or the interface will be broken at build time.
--- a/docs/history.md
+++ b/docs/history.md
@ -0,0 +1,137 @@
+# Project History — go-inference
+
+## Origin
+
+`go-inference` was created on 19 February 2026 to solve a dependency inversion problem in the Core Go ecosystem.
+
+`go-mlx` (Apple Metal inference on darwin/arm64) and `go-rocm` (AMD ROCm inference on linux/amd64) both needed to expose the same `TextModel` interface so that `go-ml` and `go-ai` could treat them interchangeably. The two backends cannot import each other — each carries platform-specific CGO or subprocess dependencies that would break cross-platform compilation.
+
+Three options were considered:
+
+1. **Duplicate interfaces** — Each backend defines its own `TextModel`. Simple to start, but the interfaces diverge over time as backends evolve without a shared contract. Rejected.
+2. **Shared interface package** (chosen) — A new package with zero dependencies defines the contract. ~100 LOC at inception, compiles on all platforms. All backends import it; it imports nothing.
+3. **Define in go-ml** — `go-ml` already had `Backend` and `StreamingBackend` types. Rejected because `go-ml` carries heavy dependencies (DuckDB, Parquet) that backends should not import.
+
+## Commit History
+
+### `fca0ed8` — Initial commit
+
+Repository scaffolding. `go.mod`, empty `README.md`.
+
+### `07cd917` — feat: define shared TextModel, Backend, Token, Message interfaces
+
+First substantive commit. Defined `TextModel`, `Backend`, `Token`, `Message`, the `Register`/`Get`/`List`/`Default`/`LoadModel` registry functions, `GenerateConfig`, `LoadConfig`, and all `With*` options. Established the zero-dependency constraint and the `Default()` priority order (metal > rocm > llama_cpp).
+
+### `3719734` — feat: add ParallelSlots to LoadConfig for concurrent inference
+
+Added `WithParallelSlots` to `LoadConfig`. Required for llama.cpp backends that allocate inference slots at load time. Metal backends ignore the field.
+
+### `2517b77` — feat: add batch inference API (Classify, BatchGenerate)
+
+Added `Classify` and `BatchGenerate` to `TextModel`, along with `ClassifyResult` and `BatchResult`. `Classify` is a prefill-only fast path (single forward pass, no autoregressive decoding) for domain classification tasks in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel.
+
+### `df17676` — feat: add GenerateMetrics type and Metrics() to TextModel
+
+Added `GenerateMetrics` and `TextModel.Metrics()`. Provides per-operation performance data: token counts, prefill and decode durations, throughput, and GPU memory usage. Required by the LEM Lab dashboard and future monitoring integrations.
+
+### `28f444c` — feat: add ModelInfo type and Info() to TextModel
+
+Added `ModelInfo` and `TextModel.Info()`. Provides static metadata about a loaded model: architecture, vocabulary size, layer count, hidden dimension, and quantisation details. Required by `go-ai` MCP tools that surface model information to agents.
+
+### `884225d` — feat: add Discover() for scanning model directories
+
+Added `Discover(baseDir string) ([]DiscoveredModel, error)` and `DiscoveredModel`. Scans a directory tree (one level deep) for model directories identified by the presence of `config.json` and `.safetensors` weight files. Used by LEM Lab's model picker UI and `go-ai`'s model listing MCP tool.
+
+### `c61ec9f` — docs: expand package doc with workflow examples
+
+Expanded the package-level godoc comment in `inference.go` with complete examples: streaming generation, chat, classification, batch generation, functional options, and model discovery.
+
+### `15ee86e` — fix: add json struct tags to Message for API serialization
+
+Added `json:"role"` and `json:"content"` tags to `Message`. Required for correct serialisation through `go-ai`'s MCP tool payloads and the agentic portal's REST API.
+
+### `d76448d` — test(inference): add comprehensive tests for all exported API
+
+1,074 lines of Pest-style tests (using Go's `testing` package and `testify`). Comprehensive coverage of:
+
+- `Register`, `Get`, `List`, `Default`, `LoadModel` — all happy paths, error paths, and edge cases
+- `Default()` priority order (metal > rocm > llama_cpp > any available)
+- All `GenerateOption` and `LoadOption` functions
+- `ApplyGenerateOpts` and `ApplyLoadOpts` — nil options, empty options, last-option-wins
+- `Discover` — single models, multiple models, quantised models, base-dir-as-model, missing files, invalid JSON
+- All struct types: `Token`, `Message`, `ClassifyResult`, `BatchResult`, `ModelInfo`, `GenerateMetrics`
+- Compile-time interface compliance assertions
+
+Dispatched to Charon (Linux build agent). Commit hash recorded in TODO.md as Phase 1 foundation marker.
+
+### `85f587a` — docs: mark Phase 1 foundation tests complete (Charon d76448d)
+
+Updated TODO.md to record Phase 1 completion and Charon's commit hash.
+
+### `c91e305` — docs: mark Phase 2 integration complete — all 3 backends migrated
+
+Updated TODO.md to record Phase 2 integration completion across go-mlx, go-rocm, and go-ml.
+
+## Phase Summary
+
+### Phase 1 — Foundation (complete)
+
+Established the interface contract, registry, functional options, model discovery, and comprehensive tests. All exported API covered. No backend implementations in this package.
+
+### Phase 2 — Integration (complete)
+
+All three backends migrated to implement `inference.TextModel` and register via `inference.Register()`:
+
+- **go-mlx** (`register_metal.go`, darwin/arm64): `metalBackend{}` + `metalAdapter{}` wrap the internal Metal model. Full `TextModel` coverage including `Classify`, `BatchGenerate`, `Info`, `Metrics`. Build-tagged `darwin && arm64`.
+- **go-rocm** (`register_rocm.go`, linux/amd64): `rocmBackend{}` spawns and manages a `llama-server` subprocess. 5,794 LOC. Build-tagged `linux && amd64`.
+- **go-ml** (`adapter.go`, `backend_http_textmodel.go`): Two-way bridge. `adapter.go` (118 LOC) wraps `inference.TextModel` into `go-ml`'s internal `Backend`/`StreamingBackend` interfaces. `backend_http_textmodel.go` (135 LOC) provides the reverse: wraps an HTTP llama.cpp server as `inference.TextModel`. `backend_mlx.go` collapsed from 253 to 35 LOC after migration.
+
+### Phase 3 — Extended Interfaces (deferred)
+
+Two interfaces are specified but not yet implemented, pending concrete consumer demand:
+
+**BatchModel** — For throughput-sensitive batch classification (e.g. `go-i18n` processing 5,000 sentences per second):
+
+```go
+type BatchModel interface {
+    TextModel
+    BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) iter.Seq2[int, Token]
+}
+```
+
+Note: the current `BatchGenerate` on `TextModel` collects all tokens before returning. A streaming `BatchModel` with `iter.Seq2` would reduce peak memory for large batches.
+
+**StatsModel** — For dashboard and monitoring integrations:
+
+```go
+type StatsModel interface {
+    TextModel
+    Stats() GenerateStats
+}
+```
+
+Where `GenerateStats` aggregates `GenerateMetrics` across multiple calls (rolling averages, peak values, histograms).
+
+Neither interface will be added until at least two consumers have a concrete need. The pattern for adding them is: define the interface in this package, update go-mlx and go-rocm to implement it, update go-ml's adapter, then update consumers.
+
+## Known Limitations
+
+**Metrics on CPU backends** — `GenerateMetrics.PeakMemoryBytes` and `ActiveMemoryBytes` are zero for CPU-only backends. There is no protocol for backends to report CPU RAM usage; this was considered unnecessary at the time of design.
+
+**`Discover` scan depth** — `Discover` scans only one level deep. Deeply nested model hierarchies (e.g. `models/org/repo/revision/`) are not found. The consumer is expected to call `Discover` on the correct parent directory.
+
+**`Discover` and invalid JSON** — A `config.json` containing invalid JSON is silently tolerated: the directory is included with an empty `ModelType`. This prevents a single malformed file from hiding all other models in a directory, but it means the returned `DiscoveredModel` may be incomplete.
+
+**No de-registration** — `Register` overwrites silently; there is no `Unregister`. This is intentional for simplicity. Backends registered in `init()` live for the lifetime of the process.
+
+**`Default()` error message** — When all registered backends are unavailable, the error says "no backends registered" rather than "no backends available". This is slightly misleading but matches the no-backends case exactly, which simplifies error handling in consumers that treat both cases identically.
+
+**`ParallelSlots` ignored by Metal** — Apple Metal manages concurrency internally. `WithParallelSlots` is accepted by `go-mlx` but has no effect. This is documented in `options.go` but not enforced.
+
+## Future Considerations
+
+- A `StatsModel` interface, when two consumers require aggregated metrics.
+- A streaming `BatchModel` with `iter.Seq2[int, Token]` for high-throughput classification.
+- Licence headers on all source files (currently absent, tracked informally).
+- A formal `CHANGELOG.md` if the package grows beyond its current single-package scope.
+- Consideration of `errors.Is`/`errors.As` sentinel errors (e.g. `ErrNoBackend`, `ErrBackendUnavailable`) to allow consumers to handle specific failure modes without string matching.