go-inference/FINDINGS.md
Snider 07cd917259 feat: define shared TextModel, Backend, Token, Message interfaces
Zero-dependency interface package for the Core inference ecosystem.
Backends (go-mlx, go-rocm) implement these interfaces.
Consumers (go-ml, go-ai, go-i18n) import them.

Includes:
- TextModel: Generate, Chat, Err, Close (with context.Context)
- Backend: Named engine registry with platform preference
- Functional options: WithMaxTokens, WithTemperature, WithTopK, etc.
- LoadModel: Auto-selects best available backend

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:37:27 +00:00

2.1 KiB

FINDINGS.md — go-inference Research & Discovery


2026-02-19: Package Creation (Virgil)

Motivation

go-mlx (darwin/arm64) and go-rocm (linux/amd64) both need to implement the same TextModel interface, but go-rocm can't import go-mlx (platform-specific CGO dependency). A shared interface package solves this.

Alternatives Considered

  1. Duplicate interfaces — Each backend defines its own TextModel. Simple but diverges over time as backends evolve independently. Rejected.
  2. Shared interface package (chosen) — core/go-inference defines the contract. ~100 LOC, zero deps, compiles everywhere.
  3. Define in go-ml — go-ml already has Backend/StreamingBackend. But go-ml has heavy deps (DuckDB, Parquet) that backends shouldn't import. Rejected.

Interface Design Decisions

  • context.Context on Generate/Chat: Required for HTTP handler cancellation, timeouts, graceful shutdown. go-ml's current backend_mlx.go already uses ctx.
  • Err() error on TextModel: iter.Seq can't carry errors. Consumers check Err() after the iterator stops. Pattern matches database/sql Row.Err().
  • Chat() on TextModel: Models own their chat templates (Gemma3, Qwen3, Llama3 all have different formats). Keeping templates in consumers means every consumer duplicates model-specific formatting.
  • Available() bool on Backend: Needed for Default() to skip unavailable backends (e.g. ROCm registered but no GPU present).
  • GPULayers in LoadConfig: ROCm/llama.cpp support partial GPU offload. Metal always does full offload. Default -1 = all layers.
  • RepeatPenalty in GenerateConfig: llama.cpp backends use this heavily. Metal backends can ignore it.

Consumer Mapping

Consumer What it imports How it uses TextModel
go-ml go-inference Wraps TextModel into its own Backend interface, adds scoring
go-ai go-inference (via go-ml) Exposes via MCP tools
go-i18n go-inference Direct: LoadModel → Generate(WithMaxTokens(1)) for classification
LEM Lab go-inference (via go-ml) Chat streaming for web UI