docs: graduate TODO/FINDINGS into production documentation
Replace internal task tracking (TODO.md, FINDINGS.md) with structured documentation in docs/. Trim CLAUDE.md to agent instructions only. Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
parent
c91e30599c
commit
74b3e7a53d
6 changed files with 710 additions and 118 deletions
66
CLAUDE.md
66
CLAUDE.md
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
Shared inference interfaces for the Core Go ecosystem. Module: `forge.lthn.ai/core/go-inference`
|
||||
|
||||
This package defines the contract between GPU-specific backends (go-mlx on macOS, go-rocm on Linux) and consumers (go-ml, go-ai, go-i18n). It has **zero dependencies** and compiles on all platforms.
|
||||
Zero dependencies. Compiles on all platforms. See `docs/architecture.md` for design rationale.
|
||||
|
||||
## Commands
|
||||
|
||||
|
|
@ -13,64 +13,34 @@ go test ./... # Run all tests
|
|||
go vet ./... # Vet
|
||||
```
|
||||
|
||||
## Architecture
|
||||
## Stability Rules
|
||||
|
||||
```
|
||||
go-inference (this package) ← defines TextModel, Backend, Token, Message
|
||||
↑ ↑
|
||||
│ │
|
||||
go-mlx (darwin/arm64) go-rocm (linux/amd64)
|
||||
│ │
|
||||
└────── go-ml ───────┘ (wraps backends into scoring engine)
|
||||
↑
|
||||
go-ai (MCP hub)
|
||||
```
|
||||
This package is the shared contract. Changes here affect go-mlx, go-rocm, and go-ml simultaneously.
|
||||
|
||||
### Key Types
|
||||
|
||||
| Type | Purpose |
|
||||
|------|---------|
|
||||
| `TextModel` | Core interface: Generate, Chat, Err, Close |
|
||||
| `Backend` | Named engine that can LoadModel → TextModel |
|
||||
| `Token` | Streaming token (ID + Text) |
|
||||
| `Message` | Chat message (Role + Content) |
|
||||
| `GenerateOption` | Functional option for generation (temp, topK, etc.) |
|
||||
| `LoadOption` | Functional option for model loading (backend, GPU layers, etc.) |
|
||||
|
||||
### Backend Registry
|
||||
|
||||
Backends register via `init()` with build tags. Consumers call `LoadModel()` which auto-selects the best available backend:
|
||||
|
||||
```go
|
||||
// Auto-detect: Metal on macOS, ROCm on Linux
|
||||
m, err := inference.LoadModel("/path/to/model/")
|
||||
|
||||
// Explicit backend
|
||||
m, err := inference.LoadModel("/path/", inference.WithBackend("rocm"))
|
||||
```
|
||||
- Never change existing method signatures on `TextModel` or `Backend`
|
||||
- Only add methods when two or more consumers need them
|
||||
- Prefer new interfaces that embed `TextModel` over extending `TextModel` itself
|
||||
- New fields on `GenerateConfig` or `LoadConfig` are safe (zero-value defaults)
|
||||
- All new interface methods require Virgil approval before merging
|
||||
|
||||
## Coding Standards
|
||||
|
||||
- UK English
|
||||
- Zero external dependencies — stdlib only
|
||||
- Tests: testify assert/require
|
||||
- Conventional commits
|
||||
- Zero external dependencies — stdlib only (testify permitted in tests)
|
||||
- Conventional commits: `type(scope): description`
|
||||
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
|
||||
- Licence: EUPL-1.2
|
||||
|
||||
## Consumers
|
||||
|
||||
- **go-mlx**: Implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
|
||||
- **go-rocm**: Implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
|
||||
- **go-ml**: Wraps inference backends into scoring engine, adds llama.cpp HTTP backend
|
||||
- **go-mlx**: implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
|
||||
- **go-rocm**: implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
|
||||
- **go-ml**: wraps inference backends into scoring engine, adds llama.cpp HTTP backend
|
||||
- **go-ai**: MCP hub, exposes inference via MCP tools
|
||||
- **go-i18n**: Uses TextModel for Gemma3-1B domain classification
|
||||
- **go-i18n**: uses `TextModel` for Gemma3-1B domain classification
|
||||
|
||||
## Stability
|
||||
## Documentation
|
||||
|
||||
This package is the shared contract. Changes here affect all backends and consumers. Keep the interface minimal and stable. Add new methods only when two or more consumers need them.
|
||||
|
||||
## Task Queue
|
||||
|
||||
See `TODO.md` for prioritised work.
|
||||
See `FINDINGS.md` for research notes.
|
||||
- `docs/architecture.md` — interfaces, registry, options, design decisions
|
||||
- `docs/development.md` — prerequisites, build, test patterns, coding standards
|
||||
- `docs/history.md` — completed phases, commit log, known limitations
|
||||
|
|
|
|||
33
FINDINGS.md
33
FINDINGS.md
|
|
@ -1,33 +0,0 @@
|
|||
# FINDINGS.md — go-inference Research & Discovery
|
||||
|
||||
---
|
||||
|
||||
## 2026-02-19: Package Creation (Virgil)
|
||||
|
||||
### Motivation
|
||||
|
||||
go-mlx (darwin/arm64) and go-rocm (linux/amd64) both need to implement the same TextModel interface, but go-rocm can't import go-mlx (platform-specific CGO dependency). A shared interface package solves this.
|
||||
|
||||
### Alternatives Considered
|
||||
|
||||
1. **Duplicate interfaces** — Each backend defines its own TextModel. Simple but diverges over time as backends evolve independently. Rejected.
|
||||
2. **Shared interface package** (chosen) — `core/go-inference` defines the contract. ~100 LOC, zero deps, compiles everywhere.
|
||||
3. **Define in go-ml** — go-ml already has Backend/StreamingBackend. But go-ml has heavy deps (DuckDB, Parquet) that backends shouldn't import. Rejected.
|
||||
|
||||
### Interface Design Decisions
|
||||
|
||||
- **`context.Context` on Generate/Chat**: Required for HTTP handler cancellation, timeouts, graceful shutdown. go-ml's current backend_mlx.go already uses ctx.
|
||||
- **`Err() error` on TextModel**: iter.Seq can't carry errors. Consumers check Err() after the iterator stops. Pattern matches database/sql Row.Err().
|
||||
- **`Chat()` on TextModel**: Models own their chat templates (Gemma3, Qwen3, Llama3 all have different formats). Keeping templates in consumers means every consumer duplicates model-specific formatting.
|
||||
- **`Available() bool` on Backend**: Needed for Default() to skip unavailable backends (e.g. ROCm registered but no GPU present).
|
||||
- **`GPULayers` in LoadConfig**: ROCm/llama.cpp support partial GPU offload. Metal always does full offload. Default -1 = all layers.
|
||||
- **`RepeatPenalty` in GenerateConfig**: llama.cpp backends use this heavily. Metal backends can ignore it.
|
||||
|
||||
### Consumer Mapping
|
||||
|
||||
| Consumer | What it imports | How it uses TextModel |
|
||||
|----------|----------------|----------------------|
|
||||
| go-ml | go-inference | Wraps TextModel into its own Backend interface, adds scoring |
|
||||
| go-ai | go-inference (via go-ml) | Exposes via MCP tools |
|
||||
| go-i18n | go-inference | Direct: LoadModel → Generate(WithMaxTokens(1)) for classification |
|
||||
| LEM Lab | go-inference (via go-ml) | Chat streaming for web UI |
|
||||
37
TODO.md
37
TODO.md
|
|
@ -1,37 +0,0 @@
|
|||
# TODO.md — go-inference Task Queue
|
||||
|
||||
Dispatched from core/go orchestration. This package is minimal by design.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation — `d76448d` (Charon)
|
||||
|
||||
- [x] **Add tests for option application** — Verify GenerateConfig defaults, all With* options, ApplyGenerateOpts/ApplyLoadOpts behaviour. Comprehensive API tests (1,074 LOC).
|
||||
- [x] **Add tests for backend registry** — Register, Get, List, Default priority order, LoadModel routing.
|
||||
- [x] **Add tests for Default() platform preference** — Verify metal > rocm > llama_cpp ordering.
|
||||
|
||||
## Phase 2: Integration — COMPLETE
|
||||
|
||||
- [x] **go-mlx migration** — `register_metal.go` implements `inference.Backend` via `metalBackend{}` + `metalAdapter{}` wrapping `internal/metal.Model`. Auto-registers via `inference.Register()` in `init()`. Build-tagged `darwin && arm64`. Full TextModel coverage: Generate, Chat, Classify, BatchGenerate, Info, Metrics, Err, Close.
|
||||
- [x] **go-rocm implementation** — `register_rocm.go` implements `inference.Backend` + `inference.TextModel` via llama-server subprocess. Auto-registers via `inference.Register(&rocmBackend{})`. Phase 4 complete (5,794 LOC by Charon).
|
||||
- [x] **go-ml migration** — `adapter.go` bridges `inference.TextModel` → `ml.Backend/StreamingBackend` (118 LOC, 13 tests). `backend_mlx.go` collapsed from 253 to 35 LOC using `inference.LoadModel`. `backend_http_textmodel.go` provides reverse wrappers (135 LOC, 19 tests).
|
||||
|
||||
## Phase 3: Extended Interfaces (when needed)
|
||||
|
||||
- [ ] **BatchModel interface** — When go-i18n needs 5K sentences/sec, add: `type BatchModel interface { TextModel; BatchGenerate(ctx, []string, ...GenerateOption) iter.Seq2[int, Token] }`. Not before it's needed.
|
||||
- [ ] **Stats interface** — When LEM Lab dashboard needs metrics: `type StatsModel interface { TextModel; Stats() GenerateStats }` with tokens/sec, peak memory, GPU util.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **Minimal interface** — Only add methods when 2+ consumers need them
|
||||
2. **Zero dependencies** — stdlib only, compiles everywhere
|
||||
3. **Backwards compatible** — New interfaces extend, never modify existing ones
|
||||
4. **Platform agnostic** — No build tags, no CGO, no OS-specific code
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Virgil in core/go manages this package directly (too small for a dedicated Claude)
|
||||
2. Changes here are coordinated with go-mlx and go-rocm Claudes via their TODO.md
|
||||
3. New interface methods require Virgil approval before adding
|
||||
302
docs/architecture.md
Normal file
302
docs/architecture.md
Normal file
|
|
@ -0,0 +1,302 @@
|
|||
# Architecture — go-inference
|
||||
|
||||
## Purpose
|
||||
|
||||
`go-inference` is the shared interface contract for text generation backends in the Core Go ecosystem. It defines the types that GPU-specific backends implement and consumers depend on, without itself importing any backend or consumer code.
|
||||
|
||||
Module path: `forge.lthn.ai/core/go-inference`
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
### Zero Dependencies
|
||||
|
||||
The package imports only the Go standard library (`context`, `fmt`, `iter`, `sync`, `time`, `encoding/json`, `os`, `path/filepath`). The sole exception is `testify` in the test tree.
|
||||
|
||||
This is a deliberate constraint. The package sits at the base of a dependency graph where:
|
||||
|
||||
- `go-mlx` pulls in CGO bindings against Apple's Metal framework
|
||||
- `go-rocm` spawns a `llama-server` subprocess with AMD ROCm libraries
|
||||
- `go-ml` links DuckDB and Parquet
|
||||
|
||||
None of those concerns belong in the interface layer. A backend can import `go-inference`; `go-inference` cannot import a backend. A consumer can import `go-inference`; `go-inference` cannot import a consumer.
|
||||
|
||||
### Minimal Interface Surface
|
||||
|
||||
New methods are only added when two or more existing consumers need them. The interfaces are deliberately narrow. Broader capability is achieved through additional interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, not through extending `TextModel` itself.
|
||||
|
||||
### Platform Agnostic
|
||||
|
||||
No build tags, no `//go:build` constraints, no `CGO_ENABLED` requirements appear in this package. It compiles cleanly on macOS, Linux, and Windows regardless of GPU availability.
|
||||
|
||||
## Ecosystem Position
|
||||
|
||||
```
|
||||
go-inference (this package) ← defines TextModel, Backend, Token, Message
|
||||
|
|
||||
|──────── implemented by ──────────────────────────────
|
||||
| |
|
||||
go-mlx go-rocm
|
||||
(darwin/arm64, Metal GPU) (linux/amd64, AMD ROCm)
|
||||
| |
|
||||
└───────────────── consumed by ────────────────────────┘
|
||||
|
|
||||
go-ml
|
||||
(scoring engine, llama.cpp HTTP)
|
||||
|
|
||||
go-ai
|
||||
(MCP hub, 30+ tools)
|
||||
|
|
||||
go-i18n
|
||||
(domain classification via Gemma3-1B)
|
||||
```
|
||||
|
||||
`go-ml` also provides a reverse adapter (`backend_http_textmodel.go`) that wraps an HTTP llama.cpp server as a `TextModel`, giving a third backend path without Metal or ROCm.
|
||||
|
||||
## Core Types
|
||||
|
||||
### Token
|
||||
|
||||
```go
|
||||
type Token struct {
|
||||
ID int32
|
||||
Text string
|
||||
}
|
||||
```
|
||||
|
||||
The atomic unit of streaming output. `ID` is the vocabulary index; `Text` is the decoded string. Backends yield these through `iter.Seq[Token]`.
|
||||
|
||||
### Message
|
||||
|
||||
```go
|
||||
type Message struct {
|
||||
Role string `json:"role"` // "system", "user", "assistant"
|
||||
Content string `json:"content"`
|
||||
}
|
||||
```
|
||||
|
||||
A single turn in a multi-turn conversation. JSON tags are present for serialisation through MCP tool payloads and API responses.
|
||||
|
||||
### ClassifyResult
|
||||
|
||||
```go
|
||||
type ClassifyResult struct {
|
||||
Token Token
|
||||
Logits []float32
|
||||
}
|
||||
```
|
||||
|
||||
Output from a single prefill-only forward pass. `Logits` is populated only when `WithLogits()` is set; it is empty by default to avoid allocating vocab-sized float arrays for every classification call.
|
||||
|
||||
### BatchResult
|
||||
|
||||
```go
|
||||
type BatchResult struct {
|
||||
Tokens []Token
|
||||
Err error
|
||||
}
|
||||
```
|
||||
|
||||
Per-prompt result from `BatchGenerate`. `Err` carries per-prompt failures (context cancellation, OOM) rather than aborting the entire batch.
|
||||
|
||||
### GenerateMetrics
|
||||
|
||||
```go
|
||||
type GenerateMetrics struct {
|
||||
PromptTokens int
|
||||
GeneratedTokens int
|
||||
PrefillDuration time.Duration
|
||||
DecodeDuration time.Duration
|
||||
TotalDuration time.Duration
|
||||
PrefillTokensPerSec float64
|
||||
DecodeTokensPerSec float64
|
||||
PeakMemoryBytes uint64
|
||||
ActiveMemoryBytes uint64
|
||||
}
|
||||
```
|
||||
|
||||
Performance data for the most recent inference operation. Retrieved via `TextModel.Metrics()` after an iterator is exhausted or a batch call returns. `PeakMemoryBytes` and `ActiveMemoryBytes` are GPU-specific; CPU-only backends may leave them at zero.
|
||||
|
||||
### ModelInfo
|
||||
|
||||
```go
|
||||
type ModelInfo struct {
|
||||
Architecture string
|
||||
VocabSize int
|
||||
NumLayers int
|
||||
HiddenSize int
|
||||
QuantBits int
|
||||
QuantGroup int
|
||||
}
|
||||
```
|
||||
|
||||
Static metadata about a loaded model. `QuantBits` is zero for unquantised (FP16/BF16) models.
|
||||
|
||||
## TextModel Interface
|
||||
|
||||
```go
|
||||
type TextModel interface {
|
||||
Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
|
||||
Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
|
||||
Classify(ctx context.Context, prompts []string, opts ...GenerateOption) ([]ClassifyResult, error)
|
||||
BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) ([]BatchResult, error)
|
||||
ModelType() string
|
||||
Info() ModelInfo
|
||||
Metrics() GenerateMetrics
|
||||
Err() error
|
||||
Close() error
|
||||
}
|
||||
```
|
||||
|
||||
Key design decisions:
|
||||
|
||||
**`context.Context` on streaming methods** — Required for HTTP handler cancellation, request timeouts, and graceful shutdown. The context is checked by backends at token boundaries.
|
||||
|
||||
**`iter.Seq[Token]` return type** — Go 1.23+ range-over-function iterators. The caller ranges over the sequence; the backend controls token production. The iterator pattern avoids channel overhead and lets the backend use direct memory access to GPU buffers.
|
||||
|
||||
**`Err() error`** — `iter.Seq` cannot carry errors alongside values. Following the `database/sql` `Row.Err()` pattern, the error from the most recent `Generate` or `Chat` call is stored internally and retrieved with `Err()` after the iterator finishes. End-of-sequence (EOS token) sets no error; context cancellation and OOM both set one.
|
||||
|
||||
**`Chat()` on the model** — Chat templates differ across architectures (Gemma3, Qwen3, Llama3 all use distinct formats). Placing template application in the backend means consumers receive already-formatted input regardless of model family. If templates lived in consumers, every consumer would need to duplicate model-specific formatting logic.
|
||||
|
||||
**`Classify()` and `BatchGenerate()`** — Two distinct batch operations with different performance characteristics. `Classify` is prefill-only (single forward pass, no autoregressive loop); it is the fast path for domain labelling in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel.
|
||||
|
||||
**`Info()` and `Metrics()`** — Separated from `Generate`/`Chat` because they serve different call sites. `Info()` is called once after load; `Metrics()` is called after each inference operation for performance monitoring.
|
||||
|
||||
## Backend Interface
|
||||
|
||||
```go
|
||||
type Backend interface {
|
||||
Name() string
|
||||
LoadModel(path string, opts ...LoadOption) (TextModel, error)
|
||||
Available() bool
|
||||
}
|
||||
```
|
||||
|
||||
**`Name()`** — Returns the registry key: `"metal"`, `"rocm"`, or `"llama_cpp"`. This is the string passed to `WithBackend()` by consumers.
|
||||
|
||||
**`LoadModel()`** — Accepts a filesystem path to a model directory (containing `config.json` and `.safetensors` weight files) and returns a ready-to-use `TextModel`. The model directory format follows the HuggingFace safetensors layout.
|
||||
|
||||
**`Available()`** — Reports whether the backend can run on the current hardware. This allows a backend to be registered unconditionally (e.g. in a shared binary) while still reporting false on platforms where its GPU runtime is absent. `Default()` skips unavailable backends.
|
||||
|
||||
## Backend Registry
|
||||
|
||||
The registry is a package-level `map[string]Backend` protected by a `sync.RWMutex`. It supports concurrent reads and exclusive writes.
|
||||
|
||||
```go
|
||||
var (
|
||||
backendsMu sync.RWMutex
|
||||
backends = map[string]Backend{}
|
||||
)
|
||||
```
|
||||
|
||||
**Registration** — Backends call `inference.Register(b Backend)` from their `init()` function. The `init()` is guarded by a build tag so it only compiles on the target platform:
|
||||
|
||||
```go
|
||||
// In go-mlx: register_metal.go
|
||||
//go:build darwin && arm64
|
||||
|
||||
func init() { inference.Register(metalBackend{}) }
|
||||
```
|
||||
|
||||
```go
|
||||
// In go-rocm: register_rocm.go
|
||||
//go:build linux && amd64
|
||||
|
||||
func init() { inference.Register(&rocmBackend{}) }
|
||||
```
|
||||
|
||||
Registering a name that already exists silently overwrites the previous entry. This allows test code to replace backends without a separate de-registration step.
|
||||
|
||||
**Discovery** — `Get(name)` performs a direct map lookup. `List()` returns all registered names (order undefined). `Default()` walks a priority list:
|
||||
|
||||
```go
|
||||
for _, name := range []string{"metal", "rocm", "llama_cpp"} {
|
||||
if b, ok := backends[name]; ok && b.Available() {
|
||||
return b, nil
|
||||
}
|
||||
}
|
||||
// Fall back to any registered available backend.
|
||||
```
|
||||
|
||||
The priority order encodes hardware preference: Metal (Apple Silicon) delivers the highest throughput for on-device inference on macOS; ROCm is preferred over llama.cpp's HTTP server on Linux because it provides direct GPU memory access without HTTP overhead.
|
||||
|
||||
**`LoadModel()` routing** — The top-level `LoadModel()` function is the primary consumer entry point:
|
||||
|
||||
```go
|
||||
func LoadModel(path string, opts ...LoadOption) (TextModel, error) {
|
||||
cfg := ApplyLoadOpts(opts)
|
||||
if cfg.Backend != "" {
|
||||
b, ok := Get(cfg.Backend)
|
||||
// ... validate and use explicit backend
|
||||
}
|
||||
b, err := Default()
|
||||
// ... use auto-selected backend
|
||||
}
|
||||
```
|
||||
|
||||
Passing `WithBackend("rocm")` bypasses `Default()` entirely. This is the mechanism used in cross-platform binaries or tests that need to pin a specific backend.
|
||||
|
||||
## Functional Options
|
||||
|
||||
Generation and loading are configured through two independent option types, both following the standard Go functional options pattern.
|
||||
|
||||
### GenerateConfig and GenerateOption
|
||||
|
||||
```go
|
||||
type GenerateConfig struct {
|
||||
MaxTokens int
|
||||
Temperature float32
|
||||
TopK int
|
||||
TopP float32
|
||||
StopTokens []int32
|
||||
RepeatPenalty float32
|
||||
ReturnLogits bool
|
||||
}
|
||||
```
|
||||
|
||||
Defaults (from `DefaultGenerateConfig()`): `MaxTokens=256`, `Temperature=0.0` (greedy), all others zero/disabled.
|
||||
|
||||
`ApplyGenerateOpts(opts []GenerateOption) GenerateConfig` is called by backends at the start of each inference operation. Options are applied in order; the last write wins for scalar fields.
|
||||
|
||||
`WithLogits()` is a flag rather than a value option because logit arrays are vocab-sized (256,128 floats for Gemma3) and should only be allocated when explicitly requested.
|
||||
|
||||
### LoadConfig and LoadOption
|
||||
|
||||
```go
|
||||
type LoadConfig struct {
|
||||
Backend string
|
||||
ContextLen int
|
||||
GPULayers int
|
||||
ParallelSlots int
|
||||
}
|
||||
```
|
||||
|
||||
Default `GPULayers` is `-1`, meaning full GPU offload. `0` forces CPU-only inference. Positive values specify a layer count for partial offload (relevant to ROCm and llama.cpp; Metal always does full offload).
|
||||
|
||||
`ParallelSlots` controls the number of concurrent inference slots the backend allocates. Higher values allow parallel `Generate`/`Chat` calls at the cost of increased VRAM usage. `0` defers to the backend's own default.
|
||||
|
||||
## Model Discovery
|
||||
|
||||
`Discover(baseDir string) ([]DiscoveredModel, error)` scans one level of a directory tree for model directories. A valid model directory must contain both `config.json` and at least one `.safetensors` file.
|
||||
|
||||
```go
|
||||
type DiscoveredModel struct {
|
||||
Path string
|
||||
ModelType string
|
||||
QuantBits int
|
||||
QuantGroup int
|
||||
NumFiles int
|
||||
}
|
||||
```
|
||||
|
||||
`Path` is always an absolute filesystem path. `ModelType` is read from `config.json`'s `model_type` field. Invalid JSON in `config.json` is silently tolerated — the directory is included with an empty `ModelType`.
|
||||
|
||||
`Discover` also checks whether `baseDir` itself is a model directory and, if so, prepends it to the result so that direct-path usage (`Discover("/models/gemma3-1b")`) works without nesting.
|
||||
|
||||
## Stability Contract
|
||||
|
||||
This package is the shared contract. Every method signature change here requires coordinated updates to go-mlx, go-rocm, and go-ml. The following rules govern interface evolution:
|
||||
|
||||
1. Existing method signatures are never changed. Rename or remove nothing from `TextModel` or `Backend`.
|
||||
2. New methods are only added when two or more consumers have a concrete need.
|
||||
3. New capability is expressed as separate interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, allowing consumers to opt in with a type assertion.
|
||||
4. `GenerateConfig` and `LoadConfig` may gain new fields with zero-value defaults; this is backwards compatible.
|
||||
253
docs/development.md
Normal file
253
docs/development.md
Normal file
|
|
@ -0,0 +1,253 @@
|
|||
# Development Guide — go-inference
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Go 1.25 or later (uses `iter.Seq` from Go 1.23 and range-over-function from 1.22)
|
||||
- No CGO, no build tags, no external tools required
|
||||
- The package compiles on macOS, Linux, and Windows without modification
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
go test ./...
|
||||
|
||||
# Run a single test by name
|
||||
go test -run TestDefault_Good_Metal ./...
|
||||
|
||||
# Vet for common mistakes
|
||||
go vet ./...
|
||||
|
||||
# View test coverage
|
||||
go test -coverprofile=coverage.out ./...
|
||||
go tool cover -html=coverage.out
|
||||
```
|
||||
|
||||
There is no Taskfile in this package; it is small enough that direct `go` invocations suffice. The parent workspace (`/Users/snider/Code/host-uk/core`) uses Task for cross-repo operations.
|
||||
|
||||
## Go Workspace
|
||||
|
||||
This package is part of the `host-uk/core` Go workspace. After adding or changing module dependencies:
|
||||
|
||||
```bash
|
||||
go work sync
|
||||
```
|
||||
|
||||
The workspace root is `/Users/snider/Code/host-uk/core`. The workspace file (`go.work`) includes this module alongside `cmd/core-gui`, `cmd/bugseti`, and others.
|
||||
|
||||
## Module Path
|
||||
|
||||
```
|
||||
forge.lthn.ai/core/go-inference
|
||||
```
|
||||
|
||||
Import it in consumers:
|
||||
|
||||
```go
|
||||
import "forge.lthn.ai/core/go-inference"
|
||||
```
|
||||
|
||||
Remote: `ssh://git@forge.lthn.ai:2223/core/go-inference.git`
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```
|
||||
go-inference/
|
||||
├── inference.go # TextModel, Backend, Token, Message, registry, LoadModel
|
||||
├── options.go # GenerateConfig, LoadConfig, all With* options
|
||||
├── discover.go # Discover() and DiscoveredModel
|
||||
├── inference_test.go # Tests for registry, LoadModel, all types
|
||||
├── options_test.go # Tests for GenerateConfig, LoadConfig, all options
|
||||
├── discover_test.go # Tests for Discover()
|
||||
├── go.mod
|
||||
├── go.sum
|
||||
├── CLAUDE.md # Agent instructions
|
||||
├── README.md
|
||||
└── docs/
|
||||
├── architecture.md
|
||||
├── development.md
|
||||
└── history.md
|
||||
```
|
||||
|
||||
## Test Patterns
|
||||
|
||||
Tests follow the `_Good`, `_Bad`, `_Ugly` suffix convention used across the Core Go ecosystem:
|
||||
|
||||
- `_Good` — happy path; confirms the documented behaviour works correctly
|
||||
- `_Bad` — expected error conditions; confirms errors are returned with useful messages
|
||||
- `_Ugly` — edge cases, panics, surprising-but-valid behaviour (e.g. last-option-wins, registry overwrites)
|
||||
|
||||
```go
|
||||
func TestDefault_Good_Metal(t *testing.T) { ... }
|
||||
func TestDefault_Bad_NoBackends(t *testing.T) { ... }
|
||||
func TestDefault_Ugly_SkipsUnavailablePreferred(t *testing.T) { ... }
|
||||
```
|
||||
|
||||
### Backend Registry Isolation
|
||||
|
||||
Tests that touch the global backend registry call `resetBackends(t)` first. This helper clears the map and is defined in `inference_test.go`:
|
||||
|
||||
```go
|
||||
func resetBackends(t *testing.T) {
|
||||
t.Helper()
|
||||
backendsMu.Lock()
|
||||
defer backendsMu.Unlock()
|
||||
backends = map[string]Backend{}
|
||||
}
|
||||
```
|
||||
|
||||
Because `resetBackends` is in the `inference` package (not `inference_test`), it has direct access to the unexported `backends` map. Tests must not rely on registration order across test functions; each test that uses the registry must call `resetBackends` at the top.
|
||||
|
||||
### Stub Implementations
|
||||
|
||||
`inference_test.go` provides `stubBackend` and `stubTextModel` — minimal implementations of `Backend` and `TextModel` for use in registry and routing tests. These are in the `inference` package itself (not a separate `_test` package) to allow access to unexported fields.
|
||||
|
||||
When writing new tests, use the existing stubs rather than creating new ones unless you need behaviour the stubs do not support.
|
||||
|
||||
### Table-Driven Tests
|
||||
|
||||
Prefer table-driven tests for options and configuration variants. The existing `TestApplyGenerateOpts_Good`, `TestWithTemperature_Good`, and `TestDefault_Good_PriorityOrder` tests demonstrate the pattern:
|
||||
|
||||
```go
|
||||
tests := []struct {
|
||||
name string
|
||||
val float32
|
||||
want float32
|
||||
}{
|
||||
{"greedy", 0.0, 0.0},
|
||||
{"low", 0.3, 0.3},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
cfg := ApplyGenerateOpts([]GenerateOption{WithTemperature(tt.val)})
|
||||
assert.InDelta(t, tt.want, cfg.Temperature, 0.0001)
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
### Assertions
|
||||
|
||||
Use `testify/assert` and `testify/require`:
|
||||
|
||||
- `require` for preconditions where failure makes subsequent assertions meaningless (e.g. `require.NoError(t, err)` before using the returned value)
|
||||
- `assert` for all other checks
|
||||
- `assert.InDelta` for float32/float64 comparisons (never `==`)
|
||||
|
||||
## Coding Standards
|
||||
|
||||
### Language
|
||||
|
||||
UK English throughout: colour, organisation, centre, licence (noun), serialise, recognise. American spellings are not accepted in comments, documentation, or error messages.
|
||||
|
||||
### Formatting
|
||||
|
||||
Standard `gofmt` formatting. No custom style rules. Run `gofmt -w .` or `go fmt ./...` before committing.
|
||||
|
||||
### Error Messages
|
||||
|
||||
Error strings start with the package name and a colon, lowercase, no trailing period:
|
||||
|
||||
```go
|
||||
fmt.Errorf("inference: no backends registered (import a backend package)")
|
||||
fmt.Errorf("inference: backend %q not registered", cfg.Backend)
|
||||
fmt.Errorf("inference: backend %q not available on this hardware", cfg.Backend)
|
||||
```
|
||||
|
||||
This convention matches the Go standard library and makes `errors.Is`/`errors.As` wrapping straightforward.
|
||||
|
||||
### Strict Types
|
||||
|
||||
All parameters and return types are explicitly typed. No `interface{}` or `any` outside of test helpers where unavoidable.
|
||||
|
||||
### Dependencies
|
||||
|
||||
No new external dependencies may be added to the production code. The `go.mod` `require` block must remain stdlib-only for non-test code. `testify` is the only permitted test dependency.
|
||||
|
||||
If you find yourself wanting an external library, reconsider the approach. This package is intentionally minimal.
|
||||
|
||||
### Licence Header
|
||||
|
||||
Every new `.go` file must carry the EUPL-1.2 licence header:
|
||||
|
||||
```go
|
||||
// Copyright (c) Lethean Technologies Ltd. All rights reserved.
|
||||
// SPDX-License-Identifier: EUPL-1.2
|
||||
```
|
||||
|
||||
Existing files without this header will be updated in a future housekeeping pass.
|
||||
|
||||
## Commit Guidelines
|
||||
|
||||
Use conventional commits:
|
||||
|
||||
```
|
||||
type(scope): short imperative description
|
||||
|
||||
Longer explanation if needed. UK English. Wrap at 72 characters.
|
||||
```
|
||||
|
||||
Types: `feat`, `fix`, `test`, `docs`, `refactor`, `chore`
|
||||
|
||||
Scope: `inference`, `options`, `discover`, or omit for cross-cutting changes.
|
||||
|
||||
Examples:
|
||||
|
||||
```
|
||||
feat(inference): add WithParallelSlots load option
|
||||
fix(discover): handle config.json with invalid JSON gracefully
|
||||
test(options): add table-driven tests for WithTopP
|
||||
docs: expand architecture section on registry priority
|
||||
```
|
||||
|
||||
Always include the co-author trailer:
|
||||
|
||||
```
|
||||
Co-Authored-By: Virgil <virgil@lethean.io>
|
||||
```
|
||||
|
||||
## Implementing a Backend
|
||||
|
||||
To implement a new backend (e.g. `go-vulkan` for cross-platform GPU inference):
|
||||
|
||||
1. Import `forge.lthn.ai/core/go-inference` in the new module.
|
||||
2. Implement `inference.Backend`:
|
||||
|
||||
```go
|
||||
type vulkanBackend struct{}
|
||||
|
||||
func (b *vulkanBackend) Name() string { return "vulkan" }
|
||||
|
||||
func (b *vulkanBackend) Available() bool {
|
||||
// Check whether Vulkan runtime is present on this host.
|
||||
return vulkan.IsAvailable()
|
||||
}
|
||||
|
||||
func (b *vulkanBackend) LoadModel(path string, opts ...inference.LoadOption) (inference.TextModel, error) {
|
||||
cfg := inference.ApplyLoadOpts(opts)
|
||||
// Load model using cfg.ContextLen, cfg.GPULayers, etc.
|
||||
return &vulkanModel{...}, nil
|
||||
}
|
||||
```
|
||||
|
||||
3. Implement `inference.TextModel` (all nine methods).
|
||||
4. Register in `init()`, guarded by the appropriate build tag:
|
||||
|
||||
```go
|
||||
//go:build linux && (amd64 || arm64)
|
||||
|
||||
func init() { inference.Register(&vulkanBackend{}) }
|
||||
```
|
||||
|
||||
5. Write stub-based tests to confirm the backend registers and `LoadModel` routes correctly without requiring real GPU hardware in CI.
|
||||
|
||||
## Extending the Interface
|
||||
|
||||
Before adding a method to `TextModel` or `Backend`, consider:
|
||||
|
||||
- Do two or more existing consumers require this capability right now?
|
||||
- Can the capability be expressed as a separate interface that embeds `TextModel`?
|
||||
- Will adding this method break existing backend implementations that do not yet provide it?
|
||||
|
||||
If the answer to the first question is no, defer the addition. If a separate interface is sufficient, prefer that approach. See `docs/architecture.md` for the stability contract.
|
||||
|
||||
When a new method is genuinely necessary, coordinate with the owners of go-mlx, go-rocm, and go-ml before merging, since all three must implement the new method simultaneously or the interface will be broken at build time.
|
||||
137
docs/history.md
Normal file
137
docs/history.md
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
# Project History — go-inference
|
||||
|
||||
## Origin
|
||||
|
||||
`go-inference` was created on 19 February 2026 to solve a dependency inversion problem in the Core Go ecosystem.
|
||||
|
||||
`go-mlx` (Apple Metal inference on darwin/arm64) and `go-rocm` (AMD ROCm inference on linux/amd64) both needed to expose the same `TextModel` interface so that `go-ml` and `go-ai` could treat them interchangeably. The two backends cannot import each other — each carries platform-specific CGO or subprocess dependencies that would break cross-platform compilation.
|
||||
|
||||
Three options were considered:
|
||||
|
||||
1. **Duplicate interfaces** — Each backend defines its own `TextModel`. Simple to start, but the interfaces diverge over time as backends evolve without a shared contract. Rejected.
|
||||
2. **Shared interface package** (chosen) — A new package with zero dependencies defines the contract. ~100 LOC at inception, compiles on all platforms. All backends import it; it imports nothing.
|
||||
3. **Define in go-ml** — `go-ml` already had `Backend` and `StreamingBackend` types. Rejected because `go-ml` carries heavy dependencies (DuckDB, Parquet) that backends should not import.
|
||||
|
||||
## Commit History
|
||||
|
||||
### `fca0ed8` — Initial commit
|
||||
|
||||
Repository scaffolding. `go.mod`, empty `README.md`.
|
||||
|
||||
### `07cd917` — feat: define shared TextModel, Backend, Token, Message interfaces
|
||||
|
||||
First substantive commit. Defined `TextModel`, `Backend`, `Token`, `Message`, the `Register`/`Get`/`List`/`Default`/`LoadModel` registry functions, `GenerateConfig`, `LoadConfig`, and all `With*` options. Established the zero-dependency constraint and the `Default()` priority order (metal > rocm > llama_cpp).
|
||||
|
||||
### `3719734` — feat: add ParallelSlots to LoadConfig for concurrent inference
|
||||
|
||||
Added `WithParallelSlots` to `LoadConfig`. Required for llama.cpp backends that allocate inference slots at load time. Metal backends ignore the field.
|
||||
|
||||
### `2517b77` — feat: add batch inference API (Classify, BatchGenerate)
|
||||
|
||||
Added `Classify` and `BatchGenerate` to `TextModel`, along with `ClassifyResult` and `BatchResult`. `Classify` is a prefill-only fast path (single forward pass, no autoregressive decoding) for domain classification tasks in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel.
|
||||
|
||||
### `df17676` — feat: add GenerateMetrics type and Metrics() to TextModel
|
||||
|
||||
Added `GenerateMetrics` and `TextModel.Metrics()`. Provides per-operation performance data: token counts, prefill and decode durations, throughput, and GPU memory usage. Required by the LEM Lab dashboard and future monitoring integrations.
|
||||
|
||||
### `28f444c` — feat: add ModelInfo type and Info() to TextModel
|
||||
|
||||
Added `ModelInfo` and `TextModel.Info()`. Provides static metadata about a loaded model: architecture, vocabulary size, layer count, hidden dimension, and quantisation details. Required by `go-ai` MCP tools that surface model information to agents.
|
||||
|
||||
### `884225d` — feat: add Discover() for scanning model directories
|
||||
|
||||
Added `Discover(baseDir string) ([]DiscoveredModel, error)` and `DiscoveredModel`. Scans a directory tree (one level deep) for model directories identified by the presence of `config.json` and `.safetensors` weight files. Used by LEM Lab's model picker UI and `go-ai`'s model listing MCP tool.
|
||||
|
||||
### `c61ec9f` — docs: expand package doc with workflow examples
|
||||
|
||||
Expanded the package-level godoc comment in `inference.go` with complete examples: streaming generation, chat, classification, batch generation, functional options, and model discovery.
|
||||
|
||||
### `15ee86e` — fix: add json struct tags to Message for API serialization
|
||||
|
||||
Added `json:"role"` and `json:"content"` tags to `Message`. Required for correct serialisation through `go-ai`'s MCP tool payloads and the agentic portal's REST API.
|
||||
|
||||
### `d76448d` — test(inference): add comprehensive tests for all exported API
|
||||
|
||||
1,074 lines of Pest-style tests (using Go's `testing` package and `testify`). Comprehensive coverage of:
|
||||
|
||||
- `Register`, `Get`, `List`, `Default`, `LoadModel` — all happy paths, error paths, and edge cases
|
||||
- `Default()` priority order (metal > rocm > llama_cpp > any available)
|
||||
- All `GenerateOption` and `LoadOption` functions
|
||||
- `ApplyGenerateOpts` and `ApplyLoadOpts` — nil options, empty options, last-option-wins
|
||||
- `Discover` — single models, multiple models, quantised models, base-dir-as-model, missing files, invalid JSON
|
||||
- All struct types: `Token`, `Message`, `ClassifyResult`, `BatchResult`, `ModelInfo`, `GenerateMetrics`
|
||||
- Compile-time interface compliance assertions
|
||||
|
||||
Dispatched to Charon (Linux build agent). Commit hash recorded in TODO.md as Phase 1 foundation marker.
|
||||
|
||||
### `85f587a` — docs: mark Phase 1 foundation tests complete (Charon d76448d)
|
||||
|
||||
Updated TODO.md to record Phase 1 completion and Charon's commit hash.
|
||||
|
||||
### `c91e305` — docs: mark Phase 2 integration complete — all 3 backends migrated
|
||||
|
||||
Updated TODO.md to record Phase 2 integration completion across go-mlx, go-rocm, and go-ml.
|
||||
|
||||
## Phase Summary
|
||||
|
||||
### Phase 1 — Foundation (complete)
|
||||
|
||||
Established the interface contract, registry, functional options, model discovery, and comprehensive tests. All exported API covered. No backend implementations in this package.
|
||||
|
||||
### Phase 2 — Integration (complete)
|
||||
|
||||
All three backends migrated to implement `inference.TextModel` and register via `inference.Register()`:
|
||||
|
||||
- **go-mlx** (`register_metal.go`, darwin/arm64): `metalBackend{}` + `metalAdapter{}` wrap the internal Metal model. Full `TextModel` coverage including `Classify`, `BatchGenerate`, `Info`, `Metrics`. Build-tagged `darwin && arm64`.
|
||||
- **go-rocm** (`register_rocm.go`, linux/amd64): `rocmBackend{}` spawns and manages a `llama-server` subprocess. 5,794 LOC. Build-tagged `linux && amd64`.
|
||||
- **go-ml** (`adapter.go`, `backend_http_textmodel.go`): Two-way bridge. `adapter.go` (118 LOC) wraps `inference.TextModel` into `go-ml`'s internal `Backend`/`StreamingBackend` interfaces. `backend_http_textmodel.go` (135 LOC) provides the reverse: wraps an HTTP llama.cpp server as `inference.TextModel`. `backend_mlx.go` collapsed from 253 to 35 LOC after migration.
|
||||
|
||||
### Phase 3 — Extended Interfaces (deferred)
|
||||
|
||||
Two interfaces are specified but not yet implemented, pending concrete consumer demand:
|
||||
|
||||
**BatchModel** — For throughput-sensitive batch classification (e.g. `go-i18n` processing 5,000 sentences per second):
|
||||
|
||||
```go
|
||||
type BatchModel interface {
|
||||
TextModel
|
||||
BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) iter.Seq2[int, Token]
|
||||
}
|
||||
```
|
||||
|
||||
Note: the current `BatchGenerate` on `TextModel` collects all tokens before returning. A streaming `BatchModel` with `iter.Seq2` would reduce peak memory for large batches.
|
||||
|
||||
**StatsModel** — For dashboard and monitoring integrations:
|
||||
|
||||
```go
|
||||
type StatsModel interface {
|
||||
TextModel
|
||||
Stats() GenerateStats
|
||||
}
|
||||
```
|
||||
|
||||
Where `GenerateStats` aggregates `GenerateMetrics` across multiple calls (rolling averages, peak values, histograms).
|
||||
|
||||
Neither interface will be added until at least two consumers have a concrete need. The pattern for adding them is: define the interface in this package, update go-mlx and go-rocm to implement it, update go-ml's adapter, then update consumers.
|
||||
|
||||
## Known Limitations
|
||||
|
||||
**Metrics on CPU backends** — `GenerateMetrics.PeakMemoryBytes` and `ActiveMemoryBytes` are zero for CPU-only backends. There is no protocol for backends to report CPU RAM usage; this was considered unnecessary at the time of design.
|
||||
|
||||
**`Discover` scan depth** — `Discover` scans only one level deep. Deeply nested model hierarchies (e.g. `models/org/repo/revision/`) are not found. The consumer is expected to call `Discover` on the correct parent directory.
|
||||
|
||||
**`Discover` and invalid JSON** — A `config.json` containing invalid JSON is silently tolerated: the directory is included with an empty `ModelType`. This prevents a single malformed file from hiding all other models in a directory, but it means the returned `DiscoveredModel` may be incomplete.
|
||||
|
||||
**No de-registration** — `Register` overwrites silently; there is no `Unregister`. This is intentional for simplicity. Backends registered in `init()` live for the lifetime of the process.
|
||||
|
||||
**`Default()` error message** — When all registered backends are unavailable, the error says "no backends registered" rather than "no backends available". This is slightly misleading but matches the no-backends case exactly, which simplifies error handling in consumers that treat both cases identically.
|
||||
|
||||
**`ParallelSlots` ignored by Metal** — Apple Metal manages concurrency internally. `WithParallelSlots` is accepted by `go-mlx` but has no effect. This is documented in `options.go` but not enforced.
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- A `StatsModel` interface, when two consumers require aggregated metrics.
|
||||
- A streaming `BatchModel` with `iter.Seq2[int, Token]` for high-throughput classification.
|
||||
- Licence headers on all source files (currently absent, tracked informally).
|
||||
- A formal `CHANGELOG.md` if the package grows beyond its current single-package scope.
|
||||
- Consideration of `errors.Is`/`errors.As` sentinel errors (e.g. `ErrNoBackend`, `ErrBackendUnavailable`) to allow consumers to handle specific failure modes without string matching.
|
||||
Loading…
Add table
Reference in a new issue