Page:
Inference Architecture
No results
2
Inference Architecture
Claude edited this page 2026-02-19 20:05:22 +00:00
Inference Architecture
How the inference layer fits together across platforms.
Layer Diagram
┌─────────────────────────────────────────────────┐
│ Consumers │
│ go-ai (MCP tools) go-i18n (classification) │
│ LEM Lab (chat UI) BugSETI (code analysis) │
├─────────────────────────────────────────────────┤
│ go-ml (scoring engine, backend adapters) │
│ backend_mlx.go backend_rocm.go backend_http │
├─────────────────────────────────────────────────┤
│ go-inference (shared interfaces) │
│ TextModel Backend Token Message Options │
├──────────────────────┬──────────────────────────┤
│ go-mlx │ go-rocm │
│ darwin/arm64 │ linux/amd64 │
│ Metal CGO (mlx-c) │ llama-server (HIP) │
│ Safetensors │ GGUF │
├──────────────────────┼──────────────────────────┤
│ Apple Metal GPU │ AMD ROCm GPU │
│ M3 Ultra (96GB) │ RX 7800 XT (16GB) │
└──────────────────────┴──────────────────────────┘
Backend Registration
Both GPU backends auto-register via build-tagged init():
// go-mlx/register_metal.go (//go:build darwin && arm64)
func init() { inference.Register(metal.NewBackend()) }
// go-rocm/register_rocm.go (//go:build linux && amd64)
func init() { inference.Register(&rocmBackend{}) }
Consumers import the backend they need (or both — only the platform-appropriate one activates):
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-mlx" // Metal on macOS
_ "forge.lthn.ai/core/go-rocm" // ROCm on Linux
)
// Auto-selects the available backend
m, err := inference.LoadModel("/path/to/model/")
TextModel Interface
type TextModel interface {
Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
ModelType() string
Err() error
Close() error
}
Key design decisions (Virgil, 19 Feb 2026):
context.Contexton all methods — required for HTTP cancellation and graceful shutdowniter.Seq[Token]for streaming — Go 1.23+ range-over-func, zero allocationErr()for post-generation error checking — iter.Seq can't carry errorsChat()with model-owned templates — each architecture knows its prompt formatClose()for deterministic GPU memory release — don't rely on GC
Model Formats
| Format | Used By | Strengths |
|---|---|---|
| Safetensors | go-mlx (Metal) | Native MLX format, direct memory mapping |
| GGUF | go-rocm (llama.cpp) | 50+ architectures, quantisation built-in, single file |
Models on macOS: /Volumes/Data/lem/safetensors/
Models on Linux: /data/lem/gguf/ (SMB mount from M3 at //10.69.69.108/Data)
Available GGUF Models (Charon, 19 Feb 2026)
| Model | Size | Fits 16GB VRAM? |
|---|---|---|
| LEK-Gemma3-1B-layered-v2-Q5_K_M | ~0.9G | Yes |
| LEK-Gemma3-1B-layered-v2-Q8_0 | ~1.4G | Yes |
| LEK-Gemma3-4B-Q4_K_M | 2.7G | Yes (baseline: 109 tok/s) |
| LEK-Gemma3-12B-Q4_K_M | ~7.5G | Yes |
| LEK-Gemma3-27B-Q4_K_M | ~16G | Tight |
| LEK-Llama-3.1-8B-Q4_K_M | ~5G | Yes |
| LEK-Mistral-7B-v0.3-Q4_K_M | ~4G | Yes |
| LEK-Qwen-2.5-7B-Q4_K_M | ~4G | Yes |
Note: Models are on SMB mount. For CI/testing, copy locally or use t.Skip() when mount unavailable.