go-inference

Module: forge.lthn.ai/core/go-inference

Shared interfaces for text generation backends. This package defines the contract between GPU-specific backends (go-mlx for Metal, go-rocm for AMD ROCm) and consumers (go-ml, go-ai, go-i18n). It has zero heavy dependencies and compiles on all platforms.

Architecture

Single-package design with four files:

File	Purpose
`inference.go`	`TextModel`, `Backend` interfaces, backend registry, `LoadModel()` entry point
`options.go`	`GenerateConfig`, `LoadConfig`, functional options (`WithMaxTokens`, `WithBackend`, etc.)
`training.go`	`TrainableModel`, `LoRAConfig`, `Adapter` interfaces, `LoadTrainable()`
`discover.go`	`Discover()` scans directories for model files (config.json + *.safetensors)

Backends register via init() with build tags. Platform preference order: metal > rocm > llama_cpp.

Key Types

Interfaces

TextModel — Core inference interface: Generate(), Chat(), Classify(), BatchGenerate(), ModelType(), Info(), Metrics(), Err(), Close()
Backend — Named inference engine: Name(), LoadModel(), Available()
TrainableModel — Extends TextModel with LoRA fine-tuning: ApplyLoRA(), Encode(), Decode(), NumLayers()
Adapter — Trainable LoRA parameters: TotalParams(), Save()
AttentionInspector — Optional interface for Q/K Bone Orientation analysis: InspectAttention()

Structs

Token — Single generated token: ID int32, Text string
Message — Chat message: Role, Content
ClassifyResult — Batch classification output: Token, Logits
BatchResult — Batch generation output: Tokens, Err
GenerateMetrics — Performance metrics: token counts, timing, throughput, GPU memory
ModelInfo — Model metadata: architecture, vocab size, layers, hidden size, quantisation
AttentionSnapshot — Q/K vectors from KV cache for attention analysis
LoRAConfig — LoRA parameters: Rank, Alpha, TargetKeys, BFloat16
DiscoveredModel — Model directory metadata from Discover()
GenerateConfig — Generation parameters: MaxTokens, Temperature, TopK, TopP, RepeatPenalty, StopTokens
LoadConfig — Model loading parameters: Backend, ContextLen, GPULayers, ParallelSlots, AdapterPath

Usage

import "forge.lthn.ai/core/go-inference"

// Load a model (auto-detects backend)
m, err := inference.LoadModel("/path/to/model/")
defer m.Close()

// Stream tokens
for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
    fmt.Print(tok.Text)
}

// Multi-turn chat
for tok := range m.Chat(ctx, []inference.Message{
    {Role: "user", Content: "Hello"},
}, inference.WithMaxTokens(64)) {
    fmt.Print(tok.Text)
}

// Batch classification (single forward pass per prompt)
results, _ := m.Classify(ctx, prompts, inference.WithTemperature(0))

// Load trainable model for fine-tuning
tm, _ := inference.LoadTrainable("/path/to/model/")
adapter := tm.ApplyLoRA(inference.DefaultLoRAConfig())

Dependencies

Zero core ecosystem dependencies — only github.com/stretchr/testify for tests
Go 1.26+ (uses iter.Seq, maps, slices)