docs: graduate TODO/FINDINGS into production documentation

Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Snider 2026-02-20 15:01:55 +00:00
parent c91e30599c
commit 74b3e7a53d
6 changed files with 710 additions and 118 deletions

View file

@ -4,7 +4,7 @@
Shared inference interfaces for the Core Go ecosystem. Module: `forge.lthn.ai/core/go-inference`
This package defines the contract between GPU-specific backends (go-mlx on macOS, go-rocm on Linux) and consumers (go-ml, go-ai, go-i18n). It has **zero dependencies** and compiles on all platforms.
Zero dependencies. Compiles on all platforms. See `docs/architecture.md` for design rationale.
## Commands
@ -13,64 +13,34 @@ go test ./... # Run all tests
go vet ./... # Vet
```
## Architecture
## Stability Rules
```
go-inference (this package) ← defines TextModel, Backend, Token, Message
↑ ↑
│ │
go-mlx (darwin/arm64) go-rocm (linux/amd64)
│ │
└────── go-ml ───────┘ (wraps backends into scoring engine)
go-ai (MCP hub)
```
This package is the shared contract. Changes here affect go-mlx, go-rocm, and go-ml simultaneously.
### Key Types
| Type | Purpose |
|------|---------|
| `TextModel` | Core interface: Generate, Chat, Err, Close |
| `Backend` | Named engine that can LoadModel → TextModel |
| `Token` | Streaming token (ID + Text) |
| `Message` | Chat message (Role + Content) |
| `GenerateOption` | Functional option for generation (temp, topK, etc.) |
| `LoadOption` | Functional option for model loading (backend, GPU layers, etc.) |
### Backend Registry
Backends register via `init()` with build tags. Consumers call `LoadModel()` which auto-selects the best available backend:
```go
// Auto-detect: Metal on macOS, ROCm on Linux
m, err := inference.LoadModel("/path/to/model/")
// Explicit backend
m, err := inference.LoadModel("/path/", inference.WithBackend("rocm"))
```
- Never change existing method signatures on `TextModel` or `Backend`
- Only add methods when two or more consumers need them
- Prefer new interfaces that embed `TextModel` over extending `TextModel` itself
- New fields on `GenerateConfig` or `LoadConfig` are safe (zero-value defaults)
- All new interface methods require Virgil approval before merging
## Coding Standards
- UK English
- Zero external dependencies — stdlib only
- Tests: testify assert/require
- Conventional commits
- Zero external dependencies — stdlib only (testify permitted in tests)
- Conventional commits: `type(scope): description`
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2
## Consumers
- **go-mlx**: Implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
- **go-rocm**: Implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
- **go-ml**: Wraps inference backends into scoring engine, adds llama.cpp HTTP backend
- **go-mlx**: implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
- **go-rocm**: implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
- **go-ml**: wraps inference backends into scoring engine, adds llama.cpp HTTP backend
- **go-ai**: MCP hub, exposes inference via MCP tools
- **go-i18n**: Uses TextModel for Gemma3-1B domain classification
- **go-i18n**: uses `TextModel` for Gemma3-1B domain classification
## Stability
## Documentation
This package is the shared contract. Changes here affect all backends and consumers. Keep the interface minimal and stable. Add new methods only when two or more consumers need them.
## Task Queue
See `TODO.md` for prioritised work.
See `FINDINGS.md` for research notes.
- `docs/architecture.md` — interfaces, registry, options, design decisions
- `docs/development.md` — prerequisites, build, test patterns, coding standards
- `docs/history.md` — completed phases, commit log, known limitations

View file

@ -1,33 +0,0 @@
# FINDINGS.md — go-inference Research & Discovery
---
## 2026-02-19: Package Creation (Virgil)
### Motivation
go-mlx (darwin/arm64) and go-rocm (linux/amd64) both need to implement the same TextModel interface, but go-rocm can't import go-mlx (platform-specific CGO dependency). A shared interface package solves this.
### Alternatives Considered
1. **Duplicate interfaces** — Each backend defines its own TextModel. Simple but diverges over time as backends evolve independently. Rejected.
2. **Shared interface package** (chosen) — `core/go-inference` defines the contract. ~100 LOC, zero deps, compiles everywhere.
3. **Define in go-ml** — go-ml already has Backend/StreamingBackend. But go-ml has heavy deps (DuckDB, Parquet) that backends shouldn't import. Rejected.
### Interface Design Decisions
- **`context.Context` on Generate/Chat**: Required for HTTP handler cancellation, timeouts, graceful shutdown. go-ml's current backend_mlx.go already uses ctx.
- **`Err() error` on TextModel**: iter.Seq can't carry errors. Consumers check Err() after the iterator stops. Pattern matches database/sql Row.Err().
- **`Chat()` on TextModel**: Models own their chat templates (Gemma3, Qwen3, Llama3 all have different formats). Keeping templates in consumers means every consumer duplicates model-specific formatting.
- **`Available() bool` on Backend**: Needed for Default() to skip unavailable backends (e.g. ROCm registered but no GPU present).
- **`GPULayers` in LoadConfig**: ROCm/llama.cpp support partial GPU offload. Metal always does full offload. Default -1 = all layers.
- **`RepeatPenalty` in GenerateConfig**: llama.cpp backends use this heavily. Metal backends can ignore it.
### Consumer Mapping
| Consumer | What it imports | How it uses TextModel |
|----------|----------------|----------------------|
| go-ml | go-inference | Wraps TextModel into its own Backend interface, adds scoring |
| go-ai | go-inference (via go-ml) | Exposes via MCP tools |
| go-i18n | go-inference | Direct: LoadModel → Generate(WithMaxTokens(1)) for classification |
| LEM Lab | go-inference (via go-ml) | Chat streaming for web UI |

37
TODO.md
View file

@ -1,37 +0,0 @@
# TODO.md — go-inference Task Queue
Dispatched from core/go orchestration. This package is minimal by design.
---
## Phase 1: Foundation — `d76448d` (Charon)
- [x] **Add tests for option application** — Verify GenerateConfig defaults, all With* options, ApplyGenerateOpts/ApplyLoadOpts behaviour. Comprehensive API tests (1,074 LOC).
- [x] **Add tests for backend registry** — Register, Get, List, Default priority order, LoadModel routing.
- [x] **Add tests for Default() platform preference** — Verify metal > rocm > llama_cpp ordering.
## Phase 2: Integration — COMPLETE
- [x] **go-mlx migration**`register_metal.go` implements `inference.Backend` via `metalBackend{}` + `metalAdapter{}` wrapping `internal/metal.Model`. Auto-registers via `inference.Register()` in `init()`. Build-tagged `darwin && arm64`. Full TextModel coverage: Generate, Chat, Classify, BatchGenerate, Info, Metrics, Err, Close.
- [x] **go-rocm implementation**`register_rocm.go` implements `inference.Backend` + `inference.TextModel` via llama-server subprocess. Auto-registers via `inference.Register(&rocmBackend{})`. Phase 4 complete (5,794 LOC by Charon).
- [x] **go-ml migration**`adapter.go` bridges `inference.TextModel``ml.Backend/StreamingBackend` (118 LOC, 13 tests). `backend_mlx.go` collapsed from 253 to 35 LOC using `inference.LoadModel`. `backend_http_textmodel.go` provides reverse wrappers (135 LOC, 19 tests).
## Phase 3: Extended Interfaces (when needed)
- [ ] **BatchModel interface** — When go-i18n needs 5K sentences/sec, add: `type BatchModel interface { TextModel; BatchGenerate(ctx, []string, ...GenerateOption) iter.Seq2[int, Token] }`. Not before it's needed.
- [ ] **Stats interface** — When LEM Lab dashboard needs metrics: `type StatsModel interface { TextModel; Stats() GenerateStats }` with tokens/sec, peak memory, GPU util.
---
## Design Principles
1. **Minimal interface** — Only add methods when 2+ consumers need them
2. **Zero dependencies** — stdlib only, compiles everywhere
3. **Backwards compatible** — New interfaces extend, never modify existing ones
4. **Platform agnostic** — No build tags, no CGO, no OS-specific code
## Workflow
1. Virgil in core/go manages this package directly (too small for a dedicated Claude)
2. Changes here are coordinated with go-mlx and go-rocm Claudes via their TODO.md
3. New interface methods require Virgil approval before adding

302
docs/architecture.md Normal file
View file

@ -0,0 +1,302 @@
# Architecture — go-inference
## Purpose
`go-inference` is the shared interface contract for text generation backends in the Core Go ecosystem. It defines the types that GPU-specific backends implement and consumers depend on, without itself importing any backend or consumer code.
Module path: `forge.lthn.ai/core/go-inference`
## Design Philosophy
### Zero Dependencies
The package imports only the Go standard library (`context`, `fmt`, `iter`, `sync`, `time`, `encoding/json`, `os`, `path/filepath`). The sole exception is `testify` in the test tree.
This is a deliberate constraint. The package sits at the base of a dependency graph where:
- `go-mlx` pulls in CGO bindings against Apple's Metal framework
- `go-rocm` spawns a `llama-server` subprocess with AMD ROCm libraries
- `go-ml` links DuckDB and Parquet
None of those concerns belong in the interface layer. A backend can import `go-inference`; `go-inference` cannot import a backend. A consumer can import `go-inference`; `go-inference` cannot import a consumer.
### Minimal Interface Surface
New methods are only added when two or more existing consumers need them. The interfaces are deliberately narrow. Broader capability is achieved through additional interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, not through extending `TextModel` itself.
### Platform Agnostic
No build tags, no `//go:build` constraints, no `CGO_ENABLED` requirements appear in this package. It compiles cleanly on macOS, Linux, and Windows regardless of GPU availability.
## Ecosystem Position
```
go-inference (this package) ← defines TextModel, Backend, Token, Message
|
|──────── implemented by ──────────────────────────────
| |
go-mlx go-rocm
(darwin/arm64, Metal GPU) (linux/amd64, AMD ROCm)
| |
└───────────────── consumed by ────────────────────────┘
|
go-ml
(scoring engine, llama.cpp HTTP)
|
go-ai
(MCP hub, 30+ tools)
|
go-i18n
(domain classification via Gemma3-1B)
```
`go-ml` also provides a reverse adapter (`backend_http_textmodel.go`) that wraps an HTTP llama.cpp server as a `TextModel`, giving a third backend path without Metal or ROCm.
## Core Types
### Token
```go
type Token struct {
ID int32
Text string
}
```
The atomic unit of streaming output. `ID` is the vocabulary index; `Text` is the decoded string. Backends yield these through `iter.Seq[Token]`.
### Message
```go
type Message struct {
Role string `json:"role"` // "system", "user", "assistant"
Content string `json:"content"`
}
```
A single turn in a multi-turn conversation. JSON tags are present for serialisation through MCP tool payloads and API responses.
### ClassifyResult
```go
type ClassifyResult struct {
Token Token
Logits []float32
}
```
Output from a single prefill-only forward pass. `Logits` is populated only when `WithLogits()` is set; it is empty by default to avoid allocating vocab-sized float arrays for every classification call.
### BatchResult
```go
type BatchResult struct {
Tokens []Token
Err error
}
```
Per-prompt result from `BatchGenerate`. `Err` carries per-prompt failures (context cancellation, OOM) rather than aborting the entire batch.
### GenerateMetrics
```go
type GenerateMetrics struct {
PromptTokens int
GeneratedTokens int
PrefillDuration time.Duration
DecodeDuration time.Duration
TotalDuration time.Duration
PrefillTokensPerSec float64
DecodeTokensPerSec float64
PeakMemoryBytes uint64
ActiveMemoryBytes uint64
}
```
Performance data for the most recent inference operation. Retrieved via `TextModel.Metrics()` after an iterator is exhausted or a batch call returns. `PeakMemoryBytes` and `ActiveMemoryBytes` are GPU-specific; CPU-only backends may leave them at zero.
### ModelInfo
```go
type ModelInfo struct {
Architecture string
VocabSize int
NumLayers int
HiddenSize int
QuantBits int
QuantGroup int
}
```
Static metadata about a loaded model. `QuantBits` is zero for unquantised (FP16/BF16) models.
## TextModel Interface
```go
type TextModel interface {
Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
Classify(ctx context.Context, prompts []string, opts ...GenerateOption) ([]ClassifyResult, error)
BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) ([]BatchResult, error)
ModelType() string
Info() ModelInfo
Metrics() GenerateMetrics
Err() error
Close() error
}
```
Key design decisions:
**`context.Context` on streaming methods** — Required for HTTP handler cancellation, request timeouts, and graceful shutdown. The context is checked by backends at token boundaries.
**`iter.Seq[Token]` return type** — Go 1.23+ range-over-function iterators. The caller ranges over the sequence; the backend controls token production. The iterator pattern avoids channel overhead and lets the backend use direct memory access to GPU buffers.
**`Err() error`** — `iter.Seq` cannot carry errors alongside values. Following the `database/sql` `Row.Err()` pattern, the error from the most recent `Generate` or `Chat` call is stored internally and retrieved with `Err()` after the iterator finishes. End-of-sequence (EOS token) sets no error; context cancellation and OOM both set one.
**`Chat()` on the model** — Chat templates differ across architectures (Gemma3, Qwen3, Llama3 all use distinct formats). Placing template application in the backend means consumers receive already-formatted input regardless of model family. If templates lived in consumers, every consumer would need to duplicate model-specific formatting logic.
**`Classify()` and `BatchGenerate()`** — Two distinct batch operations with different performance characteristics. `Classify` is prefill-only (single forward pass, no autoregressive loop); it is the fast path for domain labelling in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel.
**`Info()` and `Metrics()`** — Separated from `Generate`/`Chat` because they serve different call sites. `Info()` is called once after load; `Metrics()` is called after each inference operation for performance monitoring.
## Backend Interface
```go
type Backend interface {
Name() string
LoadModel(path string, opts ...LoadOption) (TextModel, error)
Available() bool
}
```
**`Name()`** — Returns the registry key: `"metal"`, `"rocm"`, or `"llama_cpp"`. This is the string passed to `WithBackend()` by consumers.
**`LoadModel()`** — Accepts a filesystem path to a model directory (containing `config.json` and `.safetensors` weight files) and returns a ready-to-use `TextModel`. The model directory format follows the HuggingFace safetensors layout.
**`Available()`** — Reports whether the backend can run on the current hardware. This allows a backend to be registered unconditionally (e.g. in a shared binary) while still reporting false on platforms where its GPU runtime is absent. `Default()` skips unavailable backends.
## Backend Registry
The registry is a package-level `map[string]Backend` protected by a `sync.RWMutex`. It supports concurrent reads and exclusive writes.
```go
var (
backendsMu sync.RWMutex
backends = map[string]Backend{}
)
```
**Registration** — Backends call `inference.Register(b Backend)` from their `init()` function. The `init()` is guarded by a build tag so it only compiles on the target platform:
```go
// In go-mlx: register_metal.go
//go:build darwin && arm64
func init() { inference.Register(metalBackend{}) }
```
```go
// In go-rocm: register_rocm.go
//go:build linux && amd64
func init() { inference.Register(&rocmBackend{}) }
```
Registering a name that already exists silently overwrites the previous entry. This allows test code to replace backends without a separate de-registration step.
**Discovery** — `Get(name)` performs a direct map lookup. `List()` returns all registered names (order undefined). `Default()` walks a priority list:
```go
for _, name := range []string{"metal", "rocm", "llama_cpp"} {
if b, ok := backends[name]; ok && b.Available() {
return b, nil
}
}
// Fall back to any registered available backend.
```
The priority order encodes hardware preference: Metal (Apple Silicon) delivers the highest throughput for on-device inference on macOS; ROCm is preferred over llama.cpp's HTTP server on Linux because it provides direct GPU memory access without HTTP overhead.
**`LoadModel()` routing** — The top-level `LoadModel()` function is the primary consumer entry point:
```go
func LoadModel(path string, opts ...LoadOption) (TextModel, error) {
cfg := ApplyLoadOpts(opts)
if cfg.Backend != "" {
b, ok := Get(cfg.Backend)
// ... validate and use explicit backend
}
b, err := Default()
// ... use auto-selected backend
}
```
Passing `WithBackend("rocm")` bypasses `Default()` entirely. This is the mechanism used in cross-platform binaries or tests that need to pin a specific backend.
## Functional Options
Generation and loading are configured through two independent option types, both following the standard Go functional options pattern.
### GenerateConfig and GenerateOption
```go
type GenerateConfig struct {
MaxTokens int
Temperature float32
TopK int
TopP float32
StopTokens []int32
RepeatPenalty float32
ReturnLogits bool
}
```
Defaults (from `DefaultGenerateConfig()`): `MaxTokens=256`, `Temperature=0.0` (greedy), all others zero/disabled.
`ApplyGenerateOpts(opts []GenerateOption) GenerateConfig` is called by backends at the start of each inference operation. Options are applied in order; the last write wins for scalar fields.
`WithLogits()` is a flag rather than a value option because logit arrays are vocab-sized (256,128 floats for Gemma3) and should only be allocated when explicitly requested.
### LoadConfig and LoadOption
```go
type LoadConfig struct {
Backend string
ContextLen int
GPULayers int
ParallelSlots int
}
```
Default `GPULayers` is `-1`, meaning full GPU offload. `0` forces CPU-only inference. Positive values specify a layer count for partial offload (relevant to ROCm and llama.cpp; Metal always does full offload).
`ParallelSlots` controls the number of concurrent inference slots the backend allocates. Higher values allow parallel `Generate`/`Chat` calls at the cost of increased VRAM usage. `0` defers to the backend's own default.
## Model Discovery
`Discover(baseDir string) ([]DiscoveredModel, error)` scans one level of a directory tree for model directories. A valid model directory must contain both `config.json` and at least one `.safetensors` file.
```go
type DiscoveredModel struct {
Path string
ModelType string
QuantBits int
QuantGroup int
NumFiles int
}
```
`Path` is always an absolute filesystem path. `ModelType` is read from `config.json`'s `model_type` field. Invalid JSON in `config.json` is silently tolerated — the directory is included with an empty `ModelType`.
`Discover` also checks whether `baseDir` itself is a model directory and, if so, prepends it to the result so that direct-path usage (`Discover("/models/gemma3-1b")`) works without nesting.
## Stability Contract
This package is the shared contract. Every method signature change here requires coordinated updates to go-mlx, go-rocm, and go-ml. The following rules govern interface evolution:
1. Existing method signatures are never changed. Rename or remove nothing from `TextModel` or `Backend`.
2. New methods are only added when two or more consumers have a concrete need.
3. New capability is expressed as separate interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, allowing consumers to opt in with a type assertion.
4. `GenerateConfig` and `LoadConfig` may gain new fields with zero-value defaults; this is backwards compatible.

253
docs/development.md Normal file
View file

@ -0,0 +1,253 @@
# Development Guide — go-inference
## Prerequisites
- Go 1.25 or later (uses `iter.Seq` from Go 1.23 and range-over-function from 1.22)
- No CGO, no build tags, no external tools required
- The package compiles on macOS, Linux, and Windows without modification
## Commands
```bash
# Run all tests
go test ./...
# Run a single test by name
go test -run TestDefault_Good_Metal ./...
# Vet for common mistakes
go vet ./...
# View test coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
```
There is no Taskfile in this package; it is small enough that direct `go` invocations suffice. The parent workspace (`/Users/snider/Code/host-uk/core`) uses Task for cross-repo operations.
## Go Workspace
This package is part of the `host-uk/core` Go workspace. After adding or changing module dependencies:
```bash
go work sync
```
The workspace root is `/Users/snider/Code/host-uk/core`. The workspace file (`go.work`) includes this module alongside `cmd/core-gui`, `cmd/bugseti`, and others.
## Module Path
```
forge.lthn.ai/core/go-inference
```
Import it in consumers:
```go
import "forge.lthn.ai/core/go-inference"
```
Remote: `ssh://git@forge.lthn.ai:2223/core/go-inference.git`
## Repository Layout
```
go-inference/
├── inference.go # TextModel, Backend, Token, Message, registry, LoadModel
├── options.go # GenerateConfig, LoadConfig, all With* options
├── discover.go # Discover() and DiscoveredModel
├── inference_test.go # Tests for registry, LoadModel, all types
├── options_test.go # Tests for GenerateConfig, LoadConfig, all options
├── discover_test.go # Tests for Discover()
├── go.mod
├── go.sum
├── CLAUDE.md # Agent instructions
├── README.md
└── docs/
├── architecture.md
├── development.md
└── history.md
```
## Test Patterns
Tests follow the `_Good`, `_Bad`, `_Ugly` suffix convention used across the Core Go ecosystem:
- `_Good` — happy path; confirms the documented behaviour works correctly
- `_Bad` — expected error conditions; confirms errors are returned with useful messages
- `_Ugly` — edge cases, panics, surprising-but-valid behaviour (e.g. last-option-wins, registry overwrites)
```go
func TestDefault_Good_Metal(t *testing.T) { ... }
func TestDefault_Bad_NoBackends(t *testing.T) { ... }
func TestDefault_Ugly_SkipsUnavailablePreferred(t *testing.T) { ... }
```
### Backend Registry Isolation
Tests that touch the global backend registry call `resetBackends(t)` first. This helper clears the map and is defined in `inference_test.go`:
```go
func resetBackends(t *testing.T) {
t.Helper()
backendsMu.Lock()
defer backendsMu.Unlock()
backends = map[string]Backend{}
}
```
Because `resetBackends` is in the `inference` package (not `inference_test`), it has direct access to the unexported `backends` map. Tests must not rely on registration order across test functions; each test that uses the registry must call `resetBackends` at the top.
### Stub Implementations
`inference_test.go` provides `stubBackend` and `stubTextModel` — minimal implementations of `Backend` and `TextModel` for use in registry and routing tests. These are in the `inference` package itself (not a separate `_test` package) to allow access to unexported fields.
When writing new tests, use the existing stubs rather than creating new ones unless you need behaviour the stubs do not support.
### Table-Driven Tests
Prefer table-driven tests for options and configuration variants. The existing `TestApplyGenerateOpts_Good`, `TestWithTemperature_Good`, and `TestDefault_Good_PriorityOrder` tests demonstrate the pattern:
```go
tests := []struct {
name string
val float32
want float32
}{
{"greedy", 0.0, 0.0},
{"low", 0.3, 0.3},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
cfg := ApplyGenerateOpts([]GenerateOption{WithTemperature(tt.val)})
assert.InDelta(t, tt.want, cfg.Temperature, 0.0001)
})
}
```
### Assertions
Use `testify/assert` and `testify/require`:
- `require` for preconditions where failure makes subsequent assertions meaningless (e.g. `require.NoError(t, err)` before using the returned value)
- `assert` for all other checks
- `assert.InDelta` for float32/float64 comparisons (never `==`)
## Coding Standards
### Language
UK English throughout: colour, organisation, centre, licence (noun), serialise, recognise. American spellings are not accepted in comments, documentation, or error messages.
### Formatting
Standard `gofmt` formatting. No custom style rules. Run `gofmt -w .` or `go fmt ./...` before committing.
### Error Messages
Error strings start with the package name and a colon, lowercase, no trailing period:
```go
fmt.Errorf("inference: no backends registered (import a backend package)")
fmt.Errorf("inference: backend %q not registered", cfg.Backend)
fmt.Errorf("inference: backend %q not available on this hardware", cfg.Backend)
```
This convention matches the Go standard library and makes `errors.Is`/`errors.As` wrapping straightforward.
### Strict Types
All parameters and return types are explicitly typed. No `interface{}` or `any` outside of test helpers where unavoidable.
### Dependencies
No new external dependencies may be added to the production code. The `go.mod` `require` block must remain stdlib-only for non-test code. `testify` is the only permitted test dependency.
If you find yourself wanting an external library, reconsider the approach. This package is intentionally minimal.
### Licence Header
Every new `.go` file must carry the EUPL-1.2 licence header:
```go
// Copyright (c) Lethean Technologies Ltd. All rights reserved.
// SPDX-License-Identifier: EUPL-1.2
```
Existing files without this header will be updated in a future housekeeping pass.
## Commit Guidelines
Use conventional commits:
```
type(scope): short imperative description
Longer explanation if needed. UK English. Wrap at 72 characters.
```
Types: `feat`, `fix`, `test`, `docs`, `refactor`, `chore`
Scope: `inference`, `options`, `discover`, or omit for cross-cutting changes.
Examples:
```
feat(inference): add WithParallelSlots load option
fix(discover): handle config.json with invalid JSON gracefully
test(options): add table-driven tests for WithTopP
docs: expand architecture section on registry priority
```
Always include the co-author trailer:
```
Co-Authored-By: Virgil <virgil@lethean.io>
```
## Implementing a Backend
To implement a new backend (e.g. `go-vulkan` for cross-platform GPU inference):
1. Import `forge.lthn.ai/core/go-inference` in the new module.
2. Implement `inference.Backend`:
```go
type vulkanBackend struct{}
func (b *vulkanBackend) Name() string { return "vulkan" }
func (b *vulkanBackend) Available() bool {
// Check whether Vulkan runtime is present on this host.
return vulkan.IsAvailable()
}
func (b *vulkanBackend) LoadModel(path string, opts ...inference.LoadOption) (inference.TextModel, error) {
cfg := inference.ApplyLoadOpts(opts)
// Load model using cfg.ContextLen, cfg.GPULayers, etc.
return &vulkanModel{...}, nil
}
```
3. Implement `inference.TextModel` (all nine methods).
4. Register in `init()`, guarded by the appropriate build tag:
```go
//go:build linux && (amd64 || arm64)
func init() { inference.Register(&vulkanBackend{}) }
```
5. Write stub-based tests to confirm the backend registers and `LoadModel` routes correctly without requiring real GPU hardware in CI.
## Extending the Interface
Before adding a method to `TextModel` or `Backend`, consider:
- Do two or more existing consumers require this capability right now?
- Can the capability be expressed as a separate interface that embeds `TextModel`?
- Will adding this method break existing backend implementations that do not yet provide it?
If the answer to the first question is no, defer the addition. If a separate interface is sufficient, prefer that approach. See `docs/architecture.md` for the stability contract.
When a new method is genuinely necessary, coordinate with the owners of go-mlx, go-rocm, and go-ml before merging, since all three must implement the new method simultaneously or the interface will be broken at build time.

137
docs/history.md Normal file
View file

@ -0,0 +1,137 @@
# Project History — go-inference
## Origin
`go-inference` was created on 19 February 2026 to solve a dependency inversion problem in the Core Go ecosystem.
`go-mlx` (Apple Metal inference on darwin/arm64) and `go-rocm` (AMD ROCm inference on linux/amd64) both needed to expose the same `TextModel` interface so that `go-ml` and `go-ai` could treat them interchangeably. The two backends cannot import each other — each carries platform-specific CGO or subprocess dependencies that would break cross-platform compilation.
Three options were considered:
1. **Duplicate interfaces** — Each backend defines its own `TextModel`. Simple to start, but the interfaces diverge over time as backends evolve without a shared contract. Rejected.
2. **Shared interface package** (chosen) — A new package with zero dependencies defines the contract. ~100 LOC at inception, compiles on all platforms. All backends import it; it imports nothing.
3. **Define in go-ml**`go-ml` already had `Backend` and `StreamingBackend` types. Rejected because `go-ml` carries heavy dependencies (DuckDB, Parquet) that backends should not import.
## Commit History
### `fca0ed8` — Initial commit
Repository scaffolding. `go.mod`, empty `README.md`.
### `07cd917` — feat: define shared TextModel, Backend, Token, Message interfaces
First substantive commit. Defined `TextModel`, `Backend`, `Token`, `Message`, the `Register`/`Get`/`List`/`Default`/`LoadModel` registry functions, `GenerateConfig`, `LoadConfig`, and all `With*` options. Established the zero-dependency constraint and the `Default()` priority order (metal > rocm > llama_cpp).
### `3719734` — feat: add ParallelSlots to LoadConfig for concurrent inference
Added `WithParallelSlots` to `LoadConfig`. Required for llama.cpp backends that allocate inference slots at load time. Metal backends ignore the field.
### `2517b77` — feat: add batch inference API (Classify, BatchGenerate)
Added `Classify` and `BatchGenerate` to `TextModel`, along with `ClassifyResult` and `BatchResult`. `Classify` is a prefill-only fast path (single forward pass, no autoregressive decoding) for domain classification tasks in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel.
### `df17676` — feat: add GenerateMetrics type and Metrics() to TextModel
Added `GenerateMetrics` and `TextModel.Metrics()`. Provides per-operation performance data: token counts, prefill and decode durations, throughput, and GPU memory usage. Required by the LEM Lab dashboard and future monitoring integrations.
### `28f444c` — feat: add ModelInfo type and Info() to TextModel
Added `ModelInfo` and `TextModel.Info()`. Provides static metadata about a loaded model: architecture, vocabulary size, layer count, hidden dimension, and quantisation details. Required by `go-ai` MCP tools that surface model information to agents.
### `884225d` — feat: add Discover() for scanning model directories
Added `Discover(baseDir string) ([]DiscoveredModel, error)` and `DiscoveredModel`. Scans a directory tree (one level deep) for model directories identified by the presence of `config.json` and `.safetensors` weight files. Used by LEM Lab's model picker UI and `go-ai`'s model listing MCP tool.
### `c61ec9f` — docs: expand package doc with workflow examples
Expanded the package-level godoc comment in `inference.go` with complete examples: streaming generation, chat, classification, batch generation, functional options, and model discovery.
### `15ee86e` — fix: add json struct tags to Message for API serialization
Added `json:"role"` and `json:"content"` tags to `Message`. Required for correct serialisation through `go-ai`'s MCP tool payloads and the agentic portal's REST API.
### `d76448d` — test(inference): add comprehensive tests for all exported API
1,074 lines of Pest-style tests (using Go's `testing` package and `testify`). Comprehensive coverage of:
- `Register`, `Get`, `List`, `Default`, `LoadModel` — all happy paths, error paths, and edge cases
- `Default()` priority order (metal > rocm > llama_cpp > any available)
- All `GenerateOption` and `LoadOption` functions
- `ApplyGenerateOpts` and `ApplyLoadOpts` — nil options, empty options, last-option-wins
- `Discover` — single models, multiple models, quantised models, base-dir-as-model, missing files, invalid JSON
- All struct types: `Token`, `Message`, `ClassifyResult`, `BatchResult`, `ModelInfo`, `GenerateMetrics`
- Compile-time interface compliance assertions
Dispatched to Charon (Linux build agent). Commit hash recorded in TODO.md as Phase 1 foundation marker.
### `85f587a` — docs: mark Phase 1 foundation tests complete (Charon d76448d)
Updated TODO.md to record Phase 1 completion and Charon's commit hash.
### `c91e305` — docs: mark Phase 2 integration complete — all 3 backends migrated
Updated TODO.md to record Phase 2 integration completion across go-mlx, go-rocm, and go-ml.
## Phase Summary
### Phase 1 — Foundation (complete)
Established the interface contract, registry, functional options, model discovery, and comprehensive tests. All exported API covered. No backend implementations in this package.
### Phase 2 — Integration (complete)
All three backends migrated to implement `inference.TextModel` and register via `inference.Register()`:
- **go-mlx** (`register_metal.go`, darwin/arm64): `metalBackend{}` + `metalAdapter{}` wrap the internal Metal model. Full `TextModel` coverage including `Classify`, `BatchGenerate`, `Info`, `Metrics`. Build-tagged `darwin && arm64`.
- **go-rocm** (`register_rocm.go`, linux/amd64): `rocmBackend{}` spawns and manages a `llama-server` subprocess. 5,794 LOC. Build-tagged `linux && amd64`.
- **go-ml** (`adapter.go`, `backend_http_textmodel.go`): Two-way bridge. `adapter.go` (118 LOC) wraps `inference.TextModel` into `go-ml`'s internal `Backend`/`StreamingBackend` interfaces. `backend_http_textmodel.go` (135 LOC) provides the reverse: wraps an HTTP llama.cpp server as `inference.TextModel`. `backend_mlx.go` collapsed from 253 to 35 LOC after migration.
### Phase 3 — Extended Interfaces (deferred)
Two interfaces are specified but not yet implemented, pending concrete consumer demand:
**BatchModel** — For throughput-sensitive batch classification (e.g. `go-i18n` processing 5,000 sentences per second):
```go
type BatchModel interface {
TextModel
BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) iter.Seq2[int, Token]
}
```
Note: the current `BatchGenerate` on `TextModel` collects all tokens before returning. A streaming `BatchModel` with `iter.Seq2` would reduce peak memory for large batches.
**StatsModel** — For dashboard and monitoring integrations:
```go
type StatsModel interface {
TextModel
Stats() GenerateStats
}
```
Where `GenerateStats` aggregates `GenerateMetrics` across multiple calls (rolling averages, peak values, histograms).
Neither interface will be added until at least two consumers have a concrete need. The pattern for adding them is: define the interface in this package, update go-mlx and go-rocm to implement it, update go-ml's adapter, then update consumers.
## Known Limitations
**Metrics on CPU backends** — `GenerateMetrics.PeakMemoryBytes` and `ActiveMemoryBytes` are zero for CPU-only backends. There is no protocol for backends to report CPU RAM usage; this was considered unnecessary at the time of design.
**`Discover` scan depth** — `Discover` scans only one level deep. Deeply nested model hierarchies (e.g. `models/org/repo/revision/`) are not found. The consumer is expected to call `Discover` on the correct parent directory.
**`Discover` and invalid JSON** — A `config.json` containing invalid JSON is silently tolerated: the directory is included with an empty `ModelType`. This prevents a single malformed file from hiding all other models in a directory, but it means the returned `DiscoveredModel` may be incomplete.
**No de-registration** — `Register` overwrites silently; there is no `Unregister`. This is intentional for simplicity. Backends registered in `init()` live for the lifetime of the process.
**`Default()` error message** — When all registered backends are unavailable, the error says "no backends registered" rather than "no backends available". This is slightly misleading but matches the no-backends case exactly, which simplifies error handling in consumers that treat both cases identically.
**`ParallelSlots` ignored by Metal** — Apple Metal manages concurrency internally. `WithParallelSlots` is accepted by `go-mlx` but has no effect. This is documented in `options.go` but not enforced.
## Future Considerations
- A `StatsModel` interface, when two consumers require aggregated metrics.
- A streaming `BatchModel` with `iter.Seq2[int, Token]` for high-throughput classification.
- Licence headers on all source files (currently absent, tracked informally).
- A formal `CHANGELOG.md` if the package grows beyond its current single-package scope.
- Consideration of `errors.Is`/`errors.As` sentinel errors (e.g. `ErrNoBackend`, `ErrBackendUnavailable`) to allow consumers to handle specific failure modes without string matching.