feat: define shared TextModel, Backend, Token, Message interfaces

Zero-dependency interface package for the Core inference ecosystem.
Backends (go-mlx, go-rocm) implement these interfaces.
Consumers (go-ml, go-ai, go-i18n) import them.

Includes:
- TextModel: Generate, Chat, Err, Close (with context.Context)
- Backend: Named engine registry with platform preference
- Functional options: WithMaxTokens, WithTemperature, WithTopK, etc.
- LoadModel: Auto-selects best available backend

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Snider 2026-02-19 19:37:27 +00:00
parent fca0ed8e16
commit 07cd917259
6 changed files with 392 additions and 0 deletions

76
CLAUDE.md Normal file
View file

@ -0,0 +1,76 @@
# CLAUDE.md
## What This Is
Shared inference interfaces for the Core Go ecosystem. Module: `forge.lthn.ai/core/go-inference`
This package defines the contract between GPU-specific backends (go-mlx on macOS, go-rocm on Linux) and consumers (go-ml, go-ai, go-i18n). It has **zero dependencies** and compiles on all platforms.
## Commands
```bash
go test ./... # Run all tests
go vet ./... # Vet
```
## Architecture
```
go-inference (this package) ← defines TextModel, Backend, Token, Message
↑ ↑
│ │
go-mlx (darwin/arm64) go-rocm (linux/amd64)
│ │
└────── go-ml ───────┘ (wraps backends into scoring engine)
go-ai (MCP hub)
```
### Key Types
| Type | Purpose |
|------|---------|
| `TextModel` | Core interface: Generate, Chat, Err, Close |
| `Backend` | Named engine that can LoadModel → TextModel |
| `Token` | Streaming token (ID + Text) |
| `Message` | Chat message (Role + Content) |
| `GenerateOption` | Functional option for generation (temp, topK, etc.) |
| `LoadOption` | Functional option for model loading (backend, GPU layers, etc.) |
### Backend Registry
Backends register via `init()` with build tags. Consumers call `LoadModel()` which auto-selects the best available backend:
```go
// Auto-detect: Metal on macOS, ROCm on Linux
m, err := inference.LoadModel("/path/to/model/")
// Explicit backend
m, err := inference.LoadModel("/path/", inference.WithBackend("rocm"))
```
## Coding Standards
- UK English
- Zero external dependencies — stdlib only
- Tests: testify assert/require
- Conventional commits
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2
## Consumers
- **go-mlx**: Implements `Backend` + `TextModel` for Apple Metal (darwin/arm64)
- **go-rocm**: Implements `Backend` + `TextModel` for AMD ROCm (linux/amd64)
- **go-ml**: Wraps inference backends into scoring engine, adds llama.cpp HTTP backend
- **go-ai**: MCP hub, exposes inference via MCP tools
- **go-i18n**: Uses TextModel for Gemma3-1B domain classification
## Stability
This package is the shared contract. Changes here affect all backends and consumers. Keep the interface minimal and stable. Add new methods only when two or more consumers need them.
## Task Queue
See `TODO.md` for prioritised work.
See `FINDINGS.md` for research notes.

33
FINDINGS.md Normal file
View file

@ -0,0 +1,33 @@
# FINDINGS.md — go-inference Research & Discovery
---
## 2026-02-19: Package Creation (Virgil)
### Motivation
go-mlx (darwin/arm64) and go-rocm (linux/amd64) both need to implement the same TextModel interface, but go-rocm can't import go-mlx (platform-specific CGO dependency). A shared interface package solves this.
### Alternatives Considered
1. **Duplicate interfaces** — Each backend defines its own TextModel. Simple but diverges over time as backends evolve independently. Rejected.
2. **Shared interface package** (chosen) — `core/go-inference` defines the contract. ~100 LOC, zero deps, compiles everywhere.
3. **Define in go-ml** — go-ml already has Backend/StreamingBackend. But go-ml has heavy deps (DuckDB, Parquet) that backends shouldn't import. Rejected.
### Interface Design Decisions
- **`context.Context` on Generate/Chat**: Required for HTTP handler cancellation, timeouts, graceful shutdown. go-ml's current backend_mlx.go already uses ctx.
- **`Err() error` on TextModel**: iter.Seq can't carry errors. Consumers check Err() after the iterator stops. Pattern matches database/sql Row.Err().
- **`Chat()` on TextModel**: Models own their chat templates (Gemma3, Qwen3, Llama3 all have different formats). Keeping templates in consumers means every consumer duplicates model-specific formatting.
- **`Available() bool` on Backend**: Needed for Default() to skip unavailable backends (e.g. ROCm registered but no GPU present).
- **`GPULayers` in LoadConfig**: ROCm/llama.cpp support partial GPU offload. Metal always does full offload. Default -1 = all layers.
- **`RepeatPenalty` in GenerateConfig**: llama.cpp backends use this heavily. Metal backends can ignore it.
### Consumer Mapping
| Consumer | What it imports | How it uses TextModel |
|----------|----------------|----------------------|
| go-ml | go-inference | Wraps TextModel into its own Backend interface, adds scoring |
| go-ai | go-inference (via go-ml) | Exposes via MCP tools |
| go-i18n | go-inference | Direct: LoadModel → Generate(WithMaxTokens(1)) for classification |
| LEM Lab | go-inference (via go-ml) | Chat streaming for web UI |

37
TODO.md Normal file
View file

@ -0,0 +1,37 @@
# TODO.md — go-inference Task Queue
Dispatched from core/go orchestration. This package is minimal by design.
---
## Phase 1: Foundation
- [ ] **Add tests for option application** — Verify GenerateConfig defaults, all With* options, ApplyGenerateOpts/ApplyLoadOpts behaviour.
- [ ] **Add tests for backend registry** — Register, Get, List, Default priority order, LoadModel routing.
- [ ] **Add tests for Default() platform preference** — Verify metal > rocm > llama_cpp ordering.
## Phase 2: Integration
- [ ] **go-mlx migration** — go-mlx Phase 4 backend abstraction should import go-inference instead of defining its own TextModel/Backend. Update go-mlx's design doc and plan to reference this package.
- [ ] **go-rocm implementation** — go-rocm implements inference.Backend + inference.TextModel.
- [ ] **go-ml migration** — go-ml's Backend/StreamingBackend should align with or wrap inference.TextModel. The go-ml Backend adds context.Context + non-streaming helpers on top.
## Phase 3: Extended Interfaces (when needed)
- [ ] **BatchModel interface** — When go-i18n needs 5K sentences/sec, add: `type BatchModel interface { TextModel; BatchGenerate(ctx, []string, ...GenerateOption) iter.Seq2[int, Token] }`. Not before it's needed.
- [ ] **Stats interface** — When LEM Lab dashboard needs metrics: `type StatsModel interface { TextModel; Stats() GenerateStats }` with tokens/sec, peak memory, GPU util.
---
## Design Principles
1. **Minimal interface** — Only add methods when 2+ consumers need them
2. **Zero dependencies** — stdlib only, compiles everywhere
3. **Backwards compatible** — New interfaces extend, never modify existing ones
4. **Platform agnostic** — No build tags, no CGO, no OS-specific code
## Workflow
1. Virgil in core/go manages this package directly (too small for a dedicated Claude)
2. Changes here are coordinated with go-mlx and go-rocm Claudes via their TODO.md
3. New interface methods require Virgil approval before adding

3
go.mod Normal file
View file

@ -0,0 +1,3 @@
module forge.lthn.ai/core/go-inference
go 1.25.5

145
inference.go Normal file
View file

@ -0,0 +1,145 @@
// Package inference defines shared interfaces for text generation backends.
//
// This package is the contract between GPU-specific backends (go-mlx, go-rocm)
// and consumers (go-ml, go-ai, go-i18n). It has zero dependencies and compiles
// on all platforms.
//
// Backend implementations register via init() with build tags:
//
// // go-mlx: //go:build darwin && arm64
// func init() { inference.Register(metal.NewBackend()) }
//
// // go-rocm: //go:build linux && amd64
// func init() { inference.Register(rocm.NewBackend()) }
//
// Consumers load models via the registry:
//
// m, err := inference.LoadModel("/path/to/model/")
// defer m.Close()
// for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
// fmt.Print(tok.Text)
// }
package inference
import (
"context"
"fmt"
"iter"
"sync"
)
// Token represents a single generated token for streaming.
type Token struct {
ID int32
Text string
}
// Message represents a chat message for multi-turn conversation.
type Message struct {
Role string // "system", "user", "assistant"
Content string
}
// TextModel generates text from a loaded model.
type TextModel interface {
// Generate streams tokens for the given prompt.
Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
// Chat streams tokens from a multi-turn conversation.
// The model applies its native chat template.
Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
// ModelType returns the architecture identifier (e.g. "gemma3", "qwen3", "llama3").
ModelType() string
// Err returns the error from the last Generate/Chat call, if any.
// Check this after the iterator stops to distinguish EOS from errors.
Err() error
// Close releases all resources (GPU memory, caches, subprocess).
Close() error
}
// Backend is a named inference engine that can load models.
type Backend interface {
// Name returns the backend identifier (e.g. "metal", "rocm", "llama_cpp").
Name() string
// LoadModel loads a model from the given path.
LoadModel(path string, opts ...LoadOption) (TextModel, error)
// Available reports whether this backend can run on the current hardware.
Available() bool
}
var (
backendsMu sync.RWMutex
backends = map[string]Backend{}
)
// Register adds a backend to the registry. Typically called from init().
func Register(b Backend) {
backendsMu.Lock()
defer backendsMu.Unlock()
backends[b.Name()] = b
}
// Get returns a registered backend by name.
func Get(name string) (Backend, bool) {
backendsMu.RLock()
defer backendsMu.RUnlock()
b, ok := backends[name]
return b, ok
}
// List returns the names of all registered backends.
func List() []string {
backendsMu.RLock()
defer backendsMu.RUnlock()
names := make([]string, 0, len(backends))
for name := range backends {
names = append(names, name)
}
return names
}
// Default returns the first available backend.
// Prefers "metal" on macOS, "rocm" on Linux, then any registered backend.
func Default() (Backend, error) {
backendsMu.RLock()
defer backendsMu.RUnlock()
// Platform preference order
for _, name := range []string{"metal", "rocm", "llama_cpp"} {
if b, ok := backends[name]; ok && b.Available() {
return b, nil
}
}
// Fall back to any available
for _, b := range backends {
if b.Available() {
return b, nil
}
}
return nil, fmt.Errorf("inference: no backends registered (import a backend package)")
}
// LoadModel loads a model using the specified or default backend.
func LoadModel(path string, opts ...LoadOption) (TextModel, error) {
cfg := ApplyLoadOpts(opts)
if cfg.Backend != "" {
b, ok := Get(cfg.Backend)
if !ok {
return nil, fmt.Errorf("inference: backend %q not registered", cfg.Backend)
}
if !b.Available() {
return nil, fmt.Errorf("inference: backend %q not available on this hardware", cfg.Backend)
}
return b.LoadModel(path, opts...)
}
b, err := Default()
if err != nil {
return nil, err
}
return b.LoadModel(path, opts...)
}

98
options.go Normal file
View file

@ -0,0 +1,98 @@
package inference
// GenerateConfig holds generation parameters.
type GenerateConfig struct {
MaxTokens int
Temperature float32
TopK int
TopP float32
StopTokens []int32
RepeatPenalty float32
}
// DefaultGenerateConfig returns sensible defaults.
func DefaultGenerateConfig() GenerateConfig {
return GenerateConfig{
MaxTokens: 256,
Temperature: 0.0, // greedy
}
}
// GenerateOption configures text generation.
type GenerateOption func(*GenerateConfig)
// WithMaxTokens sets the maximum number of tokens to generate.
func WithMaxTokens(n int) GenerateOption {
return func(c *GenerateConfig) { c.MaxTokens = n }
}
// WithTemperature sets the sampling temperature. 0 = greedy.
func WithTemperature(t float32) GenerateOption {
return func(c *GenerateConfig) { c.Temperature = t }
}
// WithTopK sets top-k sampling. 0 = disabled.
func WithTopK(k int) GenerateOption {
return func(c *GenerateConfig) { c.TopK = k }
}
// WithTopP sets nucleus sampling threshold. 0 = disabled.
func WithTopP(p float32) GenerateOption {
return func(c *GenerateConfig) { c.TopP = p }
}
// WithStopTokens sets token IDs that stop generation.
func WithStopTokens(ids ...int32) GenerateOption {
return func(c *GenerateConfig) { c.StopTokens = ids }
}
// WithRepeatPenalty sets the repetition penalty. 0 = disabled, 1.0 = no penalty.
func WithRepeatPenalty(p float32) GenerateOption {
return func(c *GenerateConfig) { c.RepeatPenalty = p }
}
// ApplyGenerateOpts builds a GenerateConfig from options.
func ApplyGenerateOpts(opts []GenerateOption) GenerateConfig {
cfg := DefaultGenerateConfig()
for _, o := range opts {
o(&cfg)
}
return cfg
}
// LoadConfig holds model loading parameters.
type LoadConfig struct {
Backend string // "metal", "rocm", "llama_cpp" (empty = auto-detect)
ContextLen int // Context window size (0 = model default)
GPULayers int // Number of layers to offload to GPU (-1 = all, 0 = none)
}
// LoadOption configures model loading.
type LoadOption func(*LoadConfig)
// WithBackend selects a specific inference backend by name.
func WithBackend(name string) LoadOption {
return func(c *LoadConfig) { c.Backend = name }
}
// WithContextLen sets the context window size.
func WithContextLen(n int) LoadOption {
return func(c *LoadConfig) { c.ContextLen = n }
}
// WithGPULayers sets how many layers to offload to GPU.
// -1 means all layers (full GPU offload).
func WithGPULayers(n int) LoadOption {
return func(c *LoadConfig) { c.GPULayers = n }
}
// ApplyLoadOpts builds a LoadConfig from options.
func ApplyLoadOpts(opts []LoadOption) LoadConfig {
cfg := LoadConfig{
GPULayers: -1, // default: full GPU offload
}
for _, o := range opts {
o(&cfg)
}
return cfg
}