diff --git a/CLAUDE.md b/CLAUDE.md index 44c6c3e..650d623 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ Shared inference interfaces for the Core Go ecosystem. Module: `forge.lthn.ai/core/go-inference` -This package defines the contract between GPU-specific backends (go-mlx on macOS, go-rocm on Linux) and consumers (go-ml, go-ai, go-i18n). It has **zero dependencies** and compiles on all platforms. +Zero dependencies. Compiles on all platforms. See `docs/architecture.md` for design rationale. ## Commands @@ -13,64 +13,34 @@ go test ./... # Run all tests go vet ./... # Vet ``` -## Architecture +## Stability Rules -``` -go-inference (this package) ← defines TextModel, Backend, Token, Message - ↑ ↑ - │ │ -go-mlx (darwin/arm64) go-rocm (linux/amd64) - │ │ - └────── go-ml ───────┘ (wraps backends into scoring engine) - ↑ - go-ai (MCP hub) -``` +This package is the shared contract. Changes here affect go-mlx, go-rocm, and go-ml simultaneously. -### Key Types - -| Type | Purpose | -|------|---------| -| `TextModel` | Core interface: Generate, Chat, Err, Close | -| `Backend` | Named engine that can LoadModel → TextModel | -| `Token` | Streaming token (ID + Text) | -| `Message` | Chat message (Role + Content) | -| `GenerateOption` | Functional option for generation (temp, topK, etc.) | -| `LoadOption` | Functional option for model loading (backend, GPU layers, etc.) | - -### Backend Registry - -Backends register via `init()` with build tags. Consumers call `LoadModel()` which auto-selects the best available backend: - -```go -// Auto-detect: Metal on macOS, ROCm on Linux -m, err := inference.LoadModel("/path/to/model/") - -// Explicit backend -m, err := inference.LoadModel("/path/", inference.WithBackend("rocm")) -``` +- Never change existing method signatures on `TextModel` or `Backend` +- Only add methods when two or more consumers need them +- Prefer new interfaces that embed `TextModel` over extending `TextModel` itself +- New fields on `GenerateConfig` or `LoadConfig` are safe (zero-value defaults) +- All new interface methods require Virgil approval before merging ## Coding Standards - UK English -- Zero external dependencies — stdlib only -- Tests: testify assert/require -- Conventional commits +- Zero external dependencies — stdlib only (testify permitted in tests) +- Conventional commits: `type(scope): description` - Co-Author: `Co-Authored-By: Virgil ` - Licence: EUPL-1.2 ## Consumers -- **go-mlx**: Implements `Backend` + `TextModel` for Apple Metal (darwin/arm64) -- **go-rocm**: Implements `Backend` + `TextModel` for AMD ROCm (linux/amd64) -- **go-ml**: Wraps inference backends into scoring engine, adds llama.cpp HTTP backend +- **go-mlx**: implements `Backend` + `TextModel` for Apple Metal (darwin/arm64) +- **go-rocm**: implements `Backend` + `TextModel` for AMD ROCm (linux/amd64) +- **go-ml**: wraps inference backends into scoring engine, adds llama.cpp HTTP backend - **go-ai**: MCP hub, exposes inference via MCP tools -- **go-i18n**: Uses TextModel for Gemma3-1B domain classification +- **go-i18n**: uses `TextModel` for Gemma3-1B domain classification -## Stability +## Documentation -This package is the shared contract. Changes here affect all backends and consumers. Keep the interface minimal and stable. Add new methods only when two or more consumers need them. - -## Task Queue - -See `TODO.md` for prioritised work. -See `FINDINGS.md` for research notes. +- `docs/architecture.md` — interfaces, registry, options, design decisions +- `docs/development.md` — prerequisites, build, test patterns, coding standards +- `docs/history.md` — completed phases, commit log, known limitations diff --git a/FINDINGS.md b/FINDINGS.md deleted file mode 100644 index 5a2b8e8..0000000 --- a/FINDINGS.md +++ /dev/null @@ -1,33 +0,0 @@ -# FINDINGS.md — go-inference Research & Discovery - ---- - -## 2026-02-19: Package Creation (Virgil) - -### Motivation - -go-mlx (darwin/arm64) and go-rocm (linux/amd64) both need to implement the same TextModel interface, but go-rocm can't import go-mlx (platform-specific CGO dependency). A shared interface package solves this. - -### Alternatives Considered - -1. **Duplicate interfaces** — Each backend defines its own TextModel. Simple but diverges over time as backends evolve independently. Rejected. -2. **Shared interface package** (chosen) — `core/go-inference` defines the contract. ~100 LOC, zero deps, compiles everywhere. -3. **Define in go-ml** — go-ml already has Backend/StreamingBackend. But go-ml has heavy deps (DuckDB, Parquet) that backends shouldn't import. Rejected. - -### Interface Design Decisions - -- **`context.Context` on Generate/Chat**: Required for HTTP handler cancellation, timeouts, graceful shutdown. go-ml's current backend_mlx.go already uses ctx. -- **`Err() error` on TextModel**: iter.Seq can't carry errors. Consumers check Err() after the iterator stops. Pattern matches database/sql Row.Err(). -- **`Chat()` on TextModel**: Models own their chat templates (Gemma3, Qwen3, Llama3 all have different formats). Keeping templates in consumers means every consumer duplicates model-specific formatting. -- **`Available() bool` on Backend**: Needed for Default() to skip unavailable backends (e.g. ROCm registered but no GPU present). -- **`GPULayers` in LoadConfig**: ROCm/llama.cpp support partial GPU offload. Metal always does full offload. Default -1 = all layers. -- **`RepeatPenalty` in GenerateConfig**: llama.cpp backends use this heavily. Metal backends can ignore it. - -### Consumer Mapping - -| Consumer | What it imports | How it uses TextModel | -|----------|----------------|----------------------| -| go-ml | go-inference | Wraps TextModel into its own Backend interface, adds scoring | -| go-ai | go-inference (via go-ml) | Exposes via MCP tools | -| go-i18n | go-inference | Direct: LoadModel → Generate(WithMaxTokens(1)) for classification | -| LEM Lab | go-inference (via go-ml) | Chat streaming for web UI | diff --git a/TODO.md b/TODO.md deleted file mode 100644 index 30d1592..0000000 --- a/TODO.md +++ /dev/null @@ -1,37 +0,0 @@ -# TODO.md — go-inference Task Queue - -Dispatched from core/go orchestration. This package is minimal by design. - ---- - -## Phase 1: Foundation — `d76448d` (Charon) - -- [x] **Add tests for option application** — Verify GenerateConfig defaults, all With* options, ApplyGenerateOpts/ApplyLoadOpts behaviour. Comprehensive API tests (1,074 LOC). -- [x] **Add tests for backend registry** — Register, Get, List, Default priority order, LoadModel routing. -- [x] **Add tests for Default() platform preference** — Verify metal > rocm > llama_cpp ordering. - -## Phase 2: Integration — COMPLETE - -- [x] **go-mlx migration** — `register_metal.go` implements `inference.Backend` via `metalBackend{}` + `metalAdapter{}` wrapping `internal/metal.Model`. Auto-registers via `inference.Register()` in `init()`. Build-tagged `darwin && arm64`. Full TextModel coverage: Generate, Chat, Classify, BatchGenerate, Info, Metrics, Err, Close. -- [x] **go-rocm implementation** — `register_rocm.go` implements `inference.Backend` + `inference.TextModel` via llama-server subprocess. Auto-registers via `inference.Register(&rocmBackend{})`. Phase 4 complete (5,794 LOC by Charon). -- [x] **go-ml migration** — `adapter.go` bridges `inference.TextModel` → `ml.Backend/StreamingBackend` (118 LOC, 13 tests). `backend_mlx.go` collapsed from 253 to 35 LOC using `inference.LoadModel`. `backend_http_textmodel.go` provides reverse wrappers (135 LOC, 19 tests). - -## Phase 3: Extended Interfaces (when needed) - -- [ ] **BatchModel interface** — When go-i18n needs 5K sentences/sec, add: `type BatchModel interface { TextModel; BatchGenerate(ctx, []string, ...GenerateOption) iter.Seq2[int, Token] }`. Not before it's needed. -- [ ] **Stats interface** — When LEM Lab dashboard needs metrics: `type StatsModel interface { TextModel; Stats() GenerateStats }` with tokens/sec, peak memory, GPU util. - ---- - -## Design Principles - -1. **Minimal interface** — Only add methods when 2+ consumers need them -2. **Zero dependencies** — stdlib only, compiles everywhere -3. **Backwards compatible** — New interfaces extend, never modify existing ones -4. **Platform agnostic** — No build tags, no CGO, no OS-specific code - -## Workflow - -1. Virgil in core/go manages this package directly (too small for a dedicated Claude) -2. Changes here are coordinated with go-mlx and go-rocm Claudes via their TODO.md -3. New interface methods require Virgil approval before adding diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..2f455c8 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,302 @@ +# Architecture — go-inference + +## Purpose + +`go-inference` is the shared interface contract for text generation backends in the Core Go ecosystem. It defines the types that GPU-specific backends implement and consumers depend on, without itself importing any backend or consumer code. + +Module path: `forge.lthn.ai/core/go-inference` + +## Design Philosophy + +### Zero Dependencies + +The package imports only the Go standard library (`context`, `fmt`, `iter`, `sync`, `time`, `encoding/json`, `os`, `path/filepath`). The sole exception is `testify` in the test tree. + +This is a deliberate constraint. The package sits at the base of a dependency graph where: + +- `go-mlx` pulls in CGO bindings against Apple's Metal framework +- `go-rocm` spawns a `llama-server` subprocess with AMD ROCm libraries +- `go-ml` links DuckDB and Parquet + +None of those concerns belong in the interface layer. A backend can import `go-inference`; `go-inference` cannot import a backend. A consumer can import `go-inference`; `go-inference` cannot import a consumer. + +### Minimal Interface Surface + +New methods are only added when two or more existing consumers need them. The interfaces are deliberately narrow. Broader capability is achieved through additional interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, not through extending `TextModel` itself. + +### Platform Agnostic + +No build tags, no `//go:build` constraints, no `CGO_ENABLED` requirements appear in this package. It compiles cleanly on macOS, Linux, and Windows regardless of GPU availability. + +## Ecosystem Position + +``` +go-inference (this package) ← defines TextModel, Backend, Token, Message + | + |──────── implemented by ────────────────────────────── + | | + go-mlx go-rocm + (darwin/arm64, Metal GPU) (linux/amd64, AMD ROCm) + | | + └───────────────── consumed by ────────────────────────┘ + | + go-ml + (scoring engine, llama.cpp HTTP) + | + go-ai + (MCP hub, 30+ tools) + | + go-i18n + (domain classification via Gemma3-1B) +``` + +`go-ml` also provides a reverse adapter (`backend_http_textmodel.go`) that wraps an HTTP llama.cpp server as a `TextModel`, giving a third backend path without Metal or ROCm. + +## Core Types + +### Token + +```go +type Token struct { + ID int32 + Text string +} +``` + +The atomic unit of streaming output. `ID` is the vocabulary index; `Text` is the decoded string. Backends yield these through `iter.Seq[Token]`. + +### Message + +```go +type Message struct { + Role string `json:"role"` // "system", "user", "assistant" + Content string `json:"content"` +} +``` + +A single turn in a multi-turn conversation. JSON tags are present for serialisation through MCP tool payloads and API responses. + +### ClassifyResult + +```go +type ClassifyResult struct { + Token Token + Logits []float32 +} +``` + +Output from a single prefill-only forward pass. `Logits` is populated only when `WithLogits()` is set; it is empty by default to avoid allocating vocab-sized float arrays for every classification call. + +### BatchResult + +```go +type BatchResult struct { + Tokens []Token + Err error +} +``` + +Per-prompt result from `BatchGenerate`. `Err` carries per-prompt failures (context cancellation, OOM) rather than aborting the entire batch. + +### GenerateMetrics + +```go +type GenerateMetrics struct { + PromptTokens int + GeneratedTokens int + PrefillDuration time.Duration + DecodeDuration time.Duration + TotalDuration time.Duration + PrefillTokensPerSec float64 + DecodeTokensPerSec float64 + PeakMemoryBytes uint64 + ActiveMemoryBytes uint64 +} +``` + +Performance data for the most recent inference operation. Retrieved via `TextModel.Metrics()` after an iterator is exhausted or a batch call returns. `PeakMemoryBytes` and `ActiveMemoryBytes` are GPU-specific; CPU-only backends may leave them at zero. + +### ModelInfo + +```go +type ModelInfo struct { + Architecture string + VocabSize int + NumLayers int + HiddenSize int + QuantBits int + QuantGroup int +} +``` + +Static metadata about a loaded model. `QuantBits` is zero for unquantised (FP16/BF16) models. + +## TextModel Interface + +```go +type TextModel interface { + Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token] + Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token] + Classify(ctx context.Context, prompts []string, opts ...GenerateOption) ([]ClassifyResult, error) + BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) ([]BatchResult, error) + ModelType() string + Info() ModelInfo + Metrics() GenerateMetrics + Err() error + Close() error +} +``` + +Key design decisions: + +**`context.Context` on streaming methods** — Required for HTTP handler cancellation, request timeouts, and graceful shutdown. The context is checked by backends at token boundaries. + +**`iter.Seq[Token]` return type** — Go 1.23+ range-over-function iterators. The caller ranges over the sequence; the backend controls token production. The iterator pattern avoids channel overhead and lets the backend use direct memory access to GPU buffers. + +**`Err() error`** — `iter.Seq` cannot carry errors alongside values. Following the `database/sql` `Row.Err()` pattern, the error from the most recent `Generate` or `Chat` call is stored internally and retrieved with `Err()` after the iterator finishes. End-of-sequence (EOS token) sets no error; context cancellation and OOM both set one. + +**`Chat()` on the model** — Chat templates differ across architectures (Gemma3, Qwen3, Llama3 all use distinct formats). Placing template application in the backend means consumers receive already-formatted input regardless of model family. If templates lived in consumers, every consumer would need to duplicate model-specific formatting logic. + +**`Classify()` and `BatchGenerate()`** — Two distinct batch operations with different performance characteristics. `Classify` is prefill-only (single forward pass, no autoregressive loop); it is the fast path for domain labelling in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel. + +**`Info()` and `Metrics()`** — Separated from `Generate`/`Chat` because they serve different call sites. `Info()` is called once after load; `Metrics()` is called after each inference operation for performance monitoring. + +## Backend Interface + +```go +type Backend interface { + Name() string + LoadModel(path string, opts ...LoadOption) (TextModel, error) + Available() bool +} +``` + +**`Name()`** — Returns the registry key: `"metal"`, `"rocm"`, or `"llama_cpp"`. This is the string passed to `WithBackend()` by consumers. + +**`LoadModel()`** — Accepts a filesystem path to a model directory (containing `config.json` and `.safetensors` weight files) and returns a ready-to-use `TextModel`. The model directory format follows the HuggingFace safetensors layout. + +**`Available()`** — Reports whether the backend can run on the current hardware. This allows a backend to be registered unconditionally (e.g. in a shared binary) while still reporting false on platforms where its GPU runtime is absent. `Default()` skips unavailable backends. + +## Backend Registry + +The registry is a package-level `map[string]Backend` protected by a `sync.RWMutex`. It supports concurrent reads and exclusive writes. + +```go +var ( + backendsMu sync.RWMutex + backends = map[string]Backend{} +) +``` + +**Registration** — Backends call `inference.Register(b Backend)` from their `init()` function. The `init()` is guarded by a build tag so it only compiles on the target platform: + +```go +// In go-mlx: register_metal.go +//go:build darwin && arm64 + +func init() { inference.Register(metalBackend{}) } +``` + +```go +// In go-rocm: register_rocm.go +//go:build linux && amd64 + +func init() { inference.Register(&rocmBackend{}) } +``` + +Registering a name that already exists silently overwrites the previous entry. This allows test code to replace backends without a separate de-registration step. + +**Discovery** — `Get(name)` performs a direct map lookup. `List()` returns all registered names (order undefined). `Default()` walks a priority list: + +```go +for _, name := range []string{"metal", "rocm", "llama_cpp"} { + if b, ok := backends[name]; ok && b.Available() { + return b, nil + } +} +// Fall back to any registered available backend. +``` + +The priority order encodes hardware preference: Metal (Apple Silicon) delivers the highest throughput for on-device inference on macOS; ROCm is preferred over llama.cpp's HTTP server on Linux because it provides direct GPU memory access without HTTP overhead. + +**`LoadModel()` routing** — The top-level `LoadModel()` function is the primary consumer entry point: + +```go +func LoadModel(path string, opts ...LoadOption) (TextModel, error) { + cfg := ApplyLoadOpts(opts) + if cfg.Backend != "" { + b, ok := Get(cfg.Backend) + // ... validate and use explicit backend + } + b, err := Default() + // ... use auto-selected backend +} +``` + +Passing `WithBackend("rocm")` bypasses `Default()` entirely. This is the mechanism used in cross-platform binaries or tests that need to pin a specific backend. + +## Functional Options + +Generation and loading are configured through two independent option types, both following the standard Go functional options pattern. + +### GenerateConfig and GenerateOption + +```go +type GenerateConfig struct { + MaxTokens int + Temperature float32 + TopK int + TopP float32 + StopTokens []int32 + RepeatPenalty float32 + ReturnLogits bool +} +``` + +Defaults (from `DefaultGenerateConfig()`): `MaxTokens=256`, `Temperature=0.0` (greedy), all others zero/disabled. + +`ApplyGenerateOpts(opts []GenerateOption) GenerateConfig` is called by backends at the start of each inference operation. Options are applied in order; the last write wins for scalar fields. + +`WithLogits()` is a flag rather than a value option because logit arrays are vocab-sized (256,128 floats for Gemma3) and should only be allocated when explicitly requested. + +### LoadConfig and LoadOption + +```go +type LoadConfig struct { + Backend string + ContextLen int + GPULayers int + ParallelSlots int +} +``` + +Default `GPULayers` is `-1`, meaning full GPU offload. `0` forces CPU-only inference. Positive values specify a layer count for partial offload (relevant to ROCm and llama.cpp; Metal always does full offload). + +`ParallelSlots` controls the number of concurrent inference slots the backend allocates. Higher values allow parallel `Generate`/`Chat` calls at the cost of increased VRAM usage. `0` defers to the backend's own default. + +## Model Discovery + +`Discover(baseDir string) ([]DiscoveredModel, error)` scans one level of a directory tree for model directories. A valid model directory must contain both `config.json` and at least one `.safetensors` file. + +```go +type DiscoveredModel struct { + Path string + ModelType string + QuantBits int + QuantGroup int + NumFiles int +} +``` + +`Path` is always an absolute filesystem path. `ModelType` is read from `config.json`'s `model_type` field. Invalid JSON in `config.json` is silently tolerated — the directory is included with an empty `ModelType`. + +`Discover` also checks whether `baseDir` itself is a model directory and, if so, prepends it to the result so that direct-path usage (`Discover("/models/gemma3-1b")`) works without nesting. + +## Stability Contract + +This package is the shared contract. Every method signature change here requires coordinated updates to go-mlx, go-rocm, and go-ml. The following rules govern interface evolution: + +1. Existing method signatures are never changed. Rename or remove nothing from `TextModel` or `Backend`. +2. New methods are only added when two or more consumers have a concrete need. +3. New capability is expressed as separate interfaces (`BatchModel`, `StatsModel`) that embed `TextModel`, allowing consumers to opt in with a type assertion. +4. `GenerateConfig` and `LoadConfig` may gain new fields with zero-value defaults; this is backwards compatible. diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 0000000..9a2781c --- /dev/null +++ b/docs/development.md @@ -0,0 +1,253 @@ +# Development Guide — go-inference + +## Prerequisites + +- Go 1.25 or later (uses `iter.Seq` from Go 1.23 and range-over-function from 1.22) +- No CGO, no build tags, no external tools required +- The package compiles on macOS, Linux, and Windows without modification + +## Commands + +```bash +# Run all tests +go test ./... + +# Run a single test by name +go test -run TestDefault_Good_Metal ./... + +# Vet for common mistakes +go vet ./... + +# View test coverage +go test -coverprofile=coverage.out ./... +go tool cover -html=coverage.out +``` + +There is no Taskfile in this package; it is small enough that direct `go` invocations suffice. The parent workspace (`/Users/snider/Code/host-uk/core`) uses Task for cross-repo operations. + +## Go Workspace + +This package is part of the `host-uk/core` Go workspace. After adding or changing module dependencies: + +```bash +go work sync +``` + +The workspace root is `/Users/snider/Code/host-uk/core`. The workspace file (`go.work`) includes this module alongside `cmd/core-gui`, `cmd/bugseti`, and others. + +## Module Path + +``` +forge.lthn.ai/core/go-inference +``` + +Import it in consumers: + +```go +import "forge.lthn.ai/core/go-inference" +``` + +Remote: `ssh://git@forge.lthn.ai:2223/core/go-inference.git` + +## Repository Layout + +``` +go-inference/ +├── inference.go # TextModel, Backend, Token, Message, registry, LoadModel +├── options.go # GenerateConfig, LoadConfig, all With* options +├── discover.go # Discover() and DiscoveredModel +├── inference_test.go # Tests for registry, LoadModel, all types +├── options_test.go # Tests for GenerateConfig, LoadConfig, all options +├── discover_test.go # Tests for Discover() +├── go.mod +├── go.sum +├── CLAUDE.md # Agent instructions +├── README.md +└── docs/ + ├── architecture.md + ├── development.md + └── history.md +``` + +## Test Patterns + +Tests follow the `_Good`, `_Bad`, `_Ugly` suffix convention used across the Core Go ecosystem: + +- `_Good` — happy path; confirms the documented behaviour works correctly +- `_Bad` — expected error conditions; confirms errors are returned with useful messages +- `_Ugly` — edge cases, panics, surprising-but-valid behaviour (e.g. last-option-wins, registry overwrites) + +```go +func TestDefault_Good_Metal(t *testing.T) { ... } +func TestDefault_Bad_NoBackends(t *testing.T) { ... } +func TestDefault_Ugly_SkipsUnavailablePreferred(t *testing.T) { ... } +``` + +### Backend Registry Isolation + +Tests that touch the global backend registry call `resetBackends(t)` first. This helper clears the map and is defined in `inference_test.go`: + +```go +func resetBackends(t *testing.T) { + t.Helper() + backendsMu.Lock() + defer backendsMu.Unlock() + backends = map[string]Backend{} +} +``` + +Because `resetBackends` is in the `inference` package (not `inference_test`), it has direct access to the unexported `backends` map. Tests must not rely on registration order across test functions; each test that uses the registry must call `resetBackends` at the top. + +### Stub Implementations + +`inference_test.go` provides `stubBackend` and `stubTextModel` — minimal implementations of `Backend` and `TextModel` for use in registry and routing tests. These are in the `inference` package itself (not a separate `_test` package) to allow access to unexported fields. + +When writing new tests, use the existing stubs rather than creating new ones unless you need behaviour the stubs do not support. + +### Table-Driven Tests + +Prefer table-driven tests for options and configuration variants. The existing `TestApplyGenerateOpts_Good`, `TestWithTemperature_Good`, and `TestDefault_Good_PriorityOrder` tests demonstrate the pattern: + +```go +tests := []struct { + name string + val float32 + want float32 +}{ + {"greedy", 0.0, 0.0}, + {"low", 0.3, 0.3}, +} +for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + cfg := ApplyGenerateOpts([]GenerateOption{WithTemperature(tt.val)}) + assert.InDelta(t, tt.want, cfg.Temperature, 0.0001) + }) +} +``` + +### Assertions + +Use `testify/assert` and `testify/require`: + +- `require` for preconditions where failure makes subsequent assertions meaningless (e.g. `require.NoError(t, err)` before using the returned value) +- `assert` for all other checks +- `assert.InDelta` for float32/float64 comparisons (never `==`) + +## Coding Standards + +### Language + +UK English throughout: colour, organisation, centre, licence (noun), serialise, recognise. American spellings are not accepted in comments, documentation, or error messages. + +### Formatting + +Standard `gofmt` formatting. No custom style rules. Run `gofmt -w .` or `go fmt ./...` before committing. + +### Error Messages + +Error strings start with the package name and a colon, lowercase, no trailing period: + +```go +fmt.Errorf("inference: no backends registered (import a backend package)") +fmt.Errorf("inference: backend %q not registered", cfg.Backend) +fmt.Errorf("inference: backend %q not available on this hardware", cfg.Backend) +``` + +This convention matches the Go standard library and makes `errors.Is`/`errors.As` wrapping straightforward. + +### Strict Types + +All parameters and return types are explicitly typed. No `interface{}` or `any` outside of test helpers where unavoidable. + +### Dependencies + +No new external dependencies may be added to the production code. The `go.mod` `require` block must remain stdlib-only for non-test code. `testify` is the only permitted test dependency. + +If you find yourself wanting an external library, reconsider the approach. This package is intentionally minimal. + +### Licence Header + +Every new `.go` file must carry the EUPL-1.2 licence header: + +```go +// Copyright (c) Lethean Technologies Ltd. All rights reserved. +// SPDX-License-Identifier: EUPL-1.2 +``` + +Existing files without this header will be updated in a future housekeeping pass. + +## Commit Guidelines + +Use conventional commits: + +``` +type(scope): short imperative description + +Longer explanation if needed. UK English. Wrap at 72 characters. +``` + +Types: `feat`, `fix`, `test`, `docs`, `refactor`, `chore` + +Scope: `inference`, `options`, `discover`, or omit for cross-cutting changes. + +Examples: + +``` +feat(inference): add WithParallelSlots load option +fix(discover): handle config.json with invalid JSON gracefully +test(options): add table-driven tests for WithTopP +docs: expand architecture section on registry priority +``` + +Always include the co-author trailer: + +``` +Co-Authored-By: Virgil +``` + +## Implementing a Backend + +To implement a new backend (e.g. `go-vulkan` for cross-platform GPU inference): + +1. Import `forge.lthn.ai/core/go-inference` in the new module. +2. Implement `inference.Backend`: + +```go +type vulkanBackend struct{} + +func (b *vulkanBackend) Name() string { return "vulkan" } + +func (b *vulkanBackend) Available() bool { + // Check whether Vulkan runtime is present on this host. + return vulkan.IsAvailable() +} + +func (b *vulkanBackend) LoadModel(path string, opts ...inference.LoadOption) (inference.TextModel, error) { + cfg := inference.ApplyLoadOpts(opts) + // Load model using cfg.ContextLen, cfg.GPULayers, etc. + return &vulkanModel{...}, nil +} +``` + +3. Implement `inference.TextModel` (all nine methods). +4. Register in `init()`, guarded by the appropriate build tag: + +```go +//go:build linux && (amd64 || arm64) + +func init() { inference.Register(&vulkanBackend{}) } +``` + +5. Write stub-based tests to confirm the backend registers and `LoadModel` routes correctly without requiring real GPU hardware in CI. + +## Extending the Interface + +Before adding a method to `TextModel` or `Backend`, consider: + +- Do two or more existing consumers require this capability right now? +- Can the capability be expressed as a separate interface that embeds `TextModel`? +- Will adding this method break existing backend implementations that do not yet provide it? + +If the answer to the first question is no, defer the addition. If a separate interface is sufficient, prefer that approach. See `docs/architecture.md` for the stability contract. + +When a new method is genuinely necessary, coordinate with the owners of go-mlx, go-rocm, and go-ml before merging, since all three must implement the new method simultaneously or the interface will be broken at build time. diff --git a/docs/history.md b/docs/history.md new file mode 100644 index 0000000..bfb9ce9 --- /dev/null +++ b/docs/history.md @@ -0,0 +1,137 @@ +# Project History — go-inference + +## Origin + +`go-inference` was created on 19 February 2026 to solve a dependency inversion problem in the Core Go ecosystem. + +`go-mlx` (Apple Metal inference on darwin/arm64) and `go-rocm` (AMD ROCm inference on linux/amd64) both needed to expose the same `TextModel` interface so that `go-ml` and `go-ai` could treat them interchangeably. The two backends cannot import each other — each carries platform-specific CGO or subprocess dependencies that would break cross-platform compilation. + +Three options were considered: + +1. **Duplicate interfaces** — Each backend defines its own `TextModel`. Simple to start, but the interfaces diverge over time as backends evolve without a shared contract. Rejected. +2. **Shared interface package** (chosen) — A new package with zero dependencies defines the contract. ~100 LOC at inception, compiles on all platforms. All backends import it; it imports nothing. +3. **Define in go-ml** — `go-ml` already had `Backend` and `StreamingBackend` types. Rejected because `go-ml` carries heavy dependencies (DuckDB, Parquet) that backends should not import. + +## Commit History + +### `fca0ed8` — Initial commit + +Repository scaffolding. `go.mod`, empty `README.md`. + +### `07cd917` — feat: define shared TextModel, Backend, Token, Message interfaces + +First substantive commit. Defined `TextModel`, `Backend`, `Token`, `Message`, the `Register`/`Get`/`List`/`Default`/`LoadModel` registry functions, `GenerateConfig`, `LoadConfig`, and all `With*` options. Established the zero-dependency constraint and the `Default()` priority order (metal > rocm > llama_cpp). + +### `3719734` — feat: add ParallelSlots to LoadConfig for concurrent inference + +Added `WithParallelSlots` to `LoadConfig`. Required for llama.cpp backends that allocate inference slots at load time. Metal backends ignore the field. + +### `2517b77` — feat: add batch inference API (Classify, BatchGenerate) + +Added `Classify` and `BatchGenerate` to `TextModel`, along with `ClassifyResult` and `BatchResult`. `Classify` is a prefill-only fast path (single forward pass, no autoregressive decoding) for domain classification tasks in `go-i18n`. `BatchGenerate` runs full autoregressive decoding across multiple prompts in parallel. + +### `df17676` — feat: add GenerateMetrics type and Metrics() to TextModel + +Added `GenerateMetrics` and `TextModel.Metrics()`. Provides per-operation performance data: token counts, prefill and decode durations, throughput, and GPU memory usage. Required by the LEM Lab dashboard and future monitoring integrations. + +### `28f444c` — feat: add ModelInfo type and Info() to TextModel + +Added `ModelInfo` and `TextModel.Info()`. Provides static metadata about a loaded model: architecture, vocabulary size, layer count, hidden dimension, and quantisation details. Required by `go-ai` MCP tools that surface model information to agents. + +### `884225d` — feat: add Discover() for scanning model directories + +Added `Discover(baseDir string) ([]DiscoveredModel, error)` and `DiscoveredModel`. Scans a directory tree (one level deep) for model directories identified by the presence of `config.json` and `.safetensors` weight files. Used by LEM Lab's model picker UI and `go-ai`'s model listing MCP tool. + +### `c61ec9f` — docs: expand package doc with workflow examples + +Expanded the package-level godoc comment in `inference.go` with complete examples: streaming generation, chat, classification, batch generation, functional options, and model discovery. + +### `15ee86e` — fix: add json struct tags to Message for API serialization + +Added `json:"role"` and `json:"content"` tags to `Message`. Required for correct serialisation through `go-ai`'s MCP tool payloads and the agentic portal's REST API. + +### `d76448d` — test(inference): add comprehensive tests for all exported API + +1,074 lines of Pest-style tests (using Go's `testing` package and `testify`). Comprehensive coverage of: + +- `Register`, `Get`, `List`, `Default`, `LoadModel` — all happy paths, error paths, and edge cases +- `Default()` priority order (metal > rocm > llama_cpp > any available) +- All `GenerateOption` and `LoadOption` functions +- `ApplyGenerateOpts` and `ApplyLoadOpts` — nil options, empty options, last-option-wins +- `Discover` — single models, multiple models, quantised models, base-dir-as-model, missing files, invalid JSON +- All struct types: `Token`, `Message`, `ClassifyResult`, `BatchResult`, `ModelInfo`, `GenerateMetrics` +- Compile-time interface compliance assertions + +Dispatched to Charon (Linux build agent). Commit hash recorded in TODO.md as Phase 1 foundation marker. + +### `85f587a` — docs: mark Phase 1 foundation tests complete (Charon d76448d) + +Updated TODO.md to record Phase 1 completion and Charon's commit hash. + +### `c91e305` — docs: mark Phase 2 integration complete — all 3 backends migrated + +Updated TODO.md to record Phase 2 integration completion across go-mlx, go-rocm, and go-ml. + +## Phase Summary + +### Phase 1 — Foundation (complete) + +Established the interface contract, registry, functional options, model discovery, and comprehensive tests. All exported API covered. No backend implementations in this package. + +### Phase 2 — Integration (complete) + +All three backends migrated to implement `inference.TextModel` and register via `inference.Register()`: + +- **go-mlx** (`register_metal.go`, darwin/arm64): `metalBackend{}` + `metalAdapter{}` wrap the internal Metal model. Full `TextModel` coverage including `Classify`, `BatchGenerate`, `Info`, `Metrics`. Build-tagged `darwin && arm64`. +- **go-rocm** (`register_rocm.go`, linux/amd64): `rocmBackend{}` spawns and manages a `llama-server` subprocess. 5,794 LOC. Build-tagged `linux && amd64`. +- **go-ml** (`adapter.go`, `backend_http_textmodel.go`): Two-way bridge. `adapter.go` (118 LOC) wraps `inference.TextModel` into `go-ml`'s internal `Backend`/`StreamingBackend` interfaces. `backend_http_textmodel.go` (135 LOC) provides the reverse: wraps an HTTP llama.cpp server as `inference.TextModel`. `backend_mlx.go` collapsed from 253 to 35 LOC after migration. + +### Phase 3 — Extended Interfaces (deferred) + +Two interfaces are specified but not yet implemented, pending concrete consumer demand: + +**BatchModel** — For throughput-sensitive batch classification (e.g. `go-i18n` processing 5,000 sentences per second): + +```go +type BatchModel interface { + TextModel + BatchGenerate(ctx context.Context, prompts []string, opts ...GenerateOption) iter.Seq2[int, Token] +} +``` + +Note: the current `BatchGenerate` on `TextModel` collects all tokens before returning. A streaming `BatchModel` with `iter.Seq2` would reduce peak memory for large batches. + +**StatsModel** — For dashboard and monitoring integrations: + +```go +type StatsModel interface { + TextModel + Stats() GenerateStats +} +``` + +Where `GenerateStats` aggregates `GenerateMetrics` across multiple calls (rolling averages, peak values, histograms). + +Neither interface will be added until at least two consumers have a concrete need. The pattern for adding them is: define the interface in this package, update go-mlx and go-rocm to implement it, update go-ml's adapter, then update consumers. + +## Known Limitations + +**Metrics on CPU backends** — `GenerateMetrics.PeakMemoryBytes` and `ActiveMemoryBytes` are zero for CPU-only backends. There is no protocol for backends to report CPU RAM usage; this was considered unnecessary at the time of design. + +**`Discover` scan depth** — `Discover` scans only one level deep. Deeply nested model hierarchies (e.g. `models/org/repo/revision/`) are not found. The consumer is expected to call `Discover` on the correct parent directory. + +**`Discover` and invalid JSON** — A `config.json` containing invalid JSON is silently tolerated: the directory is included with an empty `ModelType`. This prevents a single malformed file from hiding all other models in a directory, but it means the returned `DiscoveredModel` may be incomplete. + +**No de-registration** — `Register` overwrites silently; there is no `Unregister`. This is intentional for simplicity. Backends registered in `init()` live for the lifetime of the process. + +**`Default()` error message** — When all registered backends are unavailable, the error says "no backends registered" rather than "no backends available". This is slightly misleading but matches the no-backends case exactly, which simplifies error handling in consumers that treat both cases identically. + +**`ParallelSlots` ignored by Metal** — Apple Metal manages concurrency internally. `WithParallelSlots` is accepted by `go-mlx` but has no effect. This is documented in `options.go` but not enforced. + +## Future Considerations + +- A `StatsModel` interface, when two consumers require aggregated metrics. +- A streaming `BatchModel` with `iter.Seq2[int, Token]` for high-throughput classification. +- Licence headers on all source files (currently absent, tracked informally). +- A formal `CHANGELOG.md` if the package grows beyond its current single-package scope. +- Consideration of `errors.Is`/`errors.As` sentinel errors (e.g. `ErrNoBackend`, `ErrBackendUnavailable`) to allow consumers to handle specific failure modes without string matching.