1 Home
Virgil edited this page 2026-03-11 12:02:35 +00:00

go-inference

Module: forge.lthn.ai/core/go-inference

Shared interfaces for text generation backends. This package defines the contract between GPU-specific backends (go-mlx for Metal, go-rocm for AMD ROCm) and consumers (go-ml, go-ai, go-i18n). It has zero heavy dependencies and compiles on all platforms.

Architecture

Single-package design with four files:

File Purpose
inference.go TextModel, Backend interfaces, backend registry, LoadModel() entry point
options.go GenerateConfig, LoadConfig, functional options (WithMaxTokens, WithBackend, etc.)
training.go TrainableModel, LoRAConfig, Adapter interfaces, LoadTrainable()
discover.go Discover() scans directories for model files (config.json + *.safetensors)

Backends register via init() with build tags. Platform preference order: metal > rocm > llama_cpp.

Key Types

Interfaces

  • TextModel — Core inference interface: Generate(), Chat(), Classify(), BatchGenerate(), ModelType(), Info(), Metrics(), Err(), Close()
  • Backend — Named inference engine: Name(), LoadModel(), Available()
  • TrainableModel — Extends TextModel with LoRA fine-tuning: ApplyLoRA(), Encode(), Decode(), NumLayers()
  • Adapter — Trainable LoRA parameters: TotalParams(), Save()
  • AttentionInspector — Optional interface for Q/K Bone Orientation analysis: InspectAttention()

Structs

  • Token — Single generated token: ID int32, Text string
  • Message — Chat message: Role, Content
  • ClassifyResult — Batch classification output: Token, Logits
  • BatchResult — Batch generation output: Tokens, Err
  • GenerateMetrics — Performance metrics: token counts, timing, throughput, GPU memory
  • ModelInfo — Model metadata: architecture, vocab size, layers, hidden size, quantisation
  • AttentionSnapshot — Q/K vectors from KV cache for attention analysis
  • LoRAConfig — LoRA parameters: Rank, Alpha, TargetKeys, BFloat16
  • DiscoveredModel — Model directory metadata from Discover()
  • GenerateConfig — Generation parameters: MaxTokens, Temperature, TopK, TopP, RepeatPenalty, StopTokens
  • LoadConfig — Model loading parameters: Backend, ContextLen, GPULayers, ParallelSlots, AdapterPath

Usage

import "forge.lthn.ai/core/go-inference"

// Load a model (auto-detects backend)
m, err := inference.LoadModel("/path/to/model/")
defer m.Close()

// Stream tokens
for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
    fmt.Print(tok.Text)
}

// Multi-turn chat
for tok := range m.Chat(ctx, []inference.Message{
    {Role: "user", Content: "Hello"},
}, inference.WithMaxTokens(64)) {
    fmt.Print(tok.Text)
}

// Batch classification (single forward pass per prompt)
results, _ := m.Classify(ctx, prompts, inference.WithTemperature(0))

// Load trainable model for fine-tuning
tm, _ := inference.LoadTrainable("/path/to/model/")
adapter := tm.ApplyLoRA(inference.DefaultLoRAConfig())

Dependencies

  • Zero core ecosystem dependencies — only github.com/stretchr/testify for tests
  • Go 1.26+ (uses iter.Seq, maps, slices)