1
Home
Virgil edited this page 2026-03-11 12:02:35 +00:00
Table of Contents
go-inference
Module: forge.lthn.ai/core/go-inference
Shared interfaces for text generation backends. This package defines the contract between GPU-specific backends (go-mlx for Metal, go-rocm for AMD ROCm) and consumers (go-ml, go-ai, go-i18n). It has zero heavy dependencies and compiles on all platforms.
Architecture
Single-package design with four files:
| File | Purpose |
|---|---|
inference.go |
TextModel, Backend interfaces, backend registry, LoadModel() entry point |
options.go |
GenerateConfig, LoadConfig, functional options (WithMaxTokens, WithBackend, etc.) |
training.go |
TrainableModel, LoRAConfig, Adapter interfaces, LoadTrainable() |
discover.go |
Discover() scans directories for model files (config.json + *.safetensors) |
Backends register via init() with build tags. Platform preference order: metal > rocm > llama_cpp.
Key Types
Interfaces
TextModel— Core inference interface:Generate(),Chat(),Classify(),BatchGenerate(),ModelType(),Info(),Metrics(),Err(),Close()Backend— Named inference engine:Name(),LoadModel(),Available()TrainableModel— ExtendsTextModelwith LoRA fine-tuning:ApplyLoRA(),Encode(),Decode(),NumLayers()Adapter— Trainable LoRA parameters:TotalParams(),Save()AttentionInspector— Optional interface for Q/K Bone Orientation analysis:InspectAttention()
Structs
Token— Single generated token:ID int32,Text stringMessage— Chat message:Role,ContentClassifyResult— Batch classification output:Token,LogitsBatchResult— Batch generation output:Tokens,ErrGenerateMetrics— Performance metrics: token counts, timing, throughput, GPU memoryModelInfo— Model metadata: architecture, vocab size, layers, hidden size, quantisationAttentionSnapshot— Q/K vectors from KV cache for attention analysisLoRAConfig— LoRA parameters:Rank,Alpha,TargetKeys,BFloat16DiscoveredModel— Model directory metadata fromDiscover()GenerateConfig— Generation parameters:MaxTokens,Temperature,TopK,TopP,RepeatPenalty,StopTokensLoadConfig— Model loading parameters:Backend,ContextLen,GPULayers,ParallelSlots,AdapterPath
Usage
import "forge.lthn.ai/core/go-inference"
// Load a model (auto-detects backend)
m, err := inference.LoadModel("/path/to/model/")
defer m.Close()
// Stream tokens
for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
fmt.Print(tok.Text)
}
// Multi-turn chat
for tok := range m.Chat(ctx, []inference.Message{
{Role: "user", Content: "Hello"},
}, inference.WithMaxTokens(64)) {
fmt.Print(tok.Text)
}
// Batch classification (single forward pass per prompt)
results, _ := m.Classify(ctx, prompts, inference.WithTemperature(0))
// Load trainable model for fine-tuning
tm, _ := inference.LoadTrainable("/path/to/model/")
adapter := tm.ApplyLoRA(inference.DefaultLoRAConfig())
Dependencies
- Zero core ecosystem dependencies — only
github.com/stretchr/testifyfor tests - Go 1.26+ (uses
iter.Seq,maps,slices)