2 Inference Architecture
Claude edited this page 2026-02-19 20:05:22 +00:00

Inference Architecture

How the inference layer fits together across platforms.

Layer Diagram

┌─────────────────────────────────────────────────┐
│  Consumers                                       │
│  go-ai (MCP tools)  go-i18n (classification)    │
│  LEM Lab (chat UI)  BugSETI (code analysis)     │
├─────────────────────────────────────────────────┤
│  go-ml (scoring engine, backend adapters)        │
│  backend_mlx.go  backend_rocm.go  backend_http  │
├─────────────────────────────────────────────────┤
│  go-inference (shared interfaces)                │
│  TextModel  Backend  Token  Message  Options     │
├──────────────────────┬──────────────────────────┤
│  go-mlx              │  go-rocm                  │
│  darwin/arm64        │  linux/amd64              │
│  Metal CGO (mlx-c)   │  llama-server (HIP)       │
│  Safetensors         │  GGUF                     │
├──────────────────────┼──────────────────────────┤
│  Apple Metal GPU     │  AMD ROCm GPU             │
│  M3 Ultra (96GB)     │  RX 7800 XT (16GB)        │
└──────────────────────┴──────────────────────────┘

Backend Registration

Both GPU backends auto-register via build-tagged init():

// go-mlx/register_metal.go (//go:build darwin && arm64)
func init() { inference.Register(metal.NewBackend()) }

// go-rocm/register_rocm.go (//go:build linux && amd64)
func init() { inference.Register(&rocmBackend{}) }

Consumers import the backend they need (or both — only the platform-appropriate one activates):

import (
    "forge.lthn.ai/core/go-inference"
    _ "forge.lthn.ai/core/go-mlx"  // Metal on macOS
    _ "forge.lthn.ai/core/go-rocm" // ROCm on Linux
)

// Auto-selects the available backend
m, err := inference.LoadModel("/path/to/model/")

TextModel Interface

type TextModel interface {
    Generate(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq[Token]
    Chat(ctx context.Context, messages []Message, opts ...GenerateOption) iter.Seq[Token]
    ModelType() string
    Err() error
    Close() error
}

Key design decisions (Virgil, 19 Feb 2026):

  • context.Context on all methods — required for HTTP cancellation and graceful shutdown
  • iter.Seq[Token] for streaming — Go 1.23+ range-over-func, zero allocation
  • Err() for post-generation error checking — iter.Seq can't carry errors
  • Chat() with model-owned templates — each architecture knows its prompt format
  • Close() for deterministic GPU memory release — don't rely on GC

Model Formats

Format Used By Strengths
Safetensors go-mlx (Metal) Native MLX format, direct memory mapping
GGUF go-rocm (llama.cpp) 50+ architectures, quantisation built-in, single file

Models on macOS: /Volumes/Data/lem/safetensors/ Models on Linux: /data/lem/gguf/ (SMB mount from M3 at //10.69.69.108/Data)

Available GGUF Models (Charon, 19 Feb 2026)

Model Size Fits 16GB VRAM?
LEK-Gemma3-1B-layered-v2-Q5_K_M ~0.9G Yes
LEK-Gemma3-1B-layered-v2-Q8_0 ~1.4G Yes
LEK-Gemma3-4B-Q4_K_M 2.7G Yes (baseline: 109 tok/s)
LEK-Gemma3-12B-Q4_K_M ~7.5G Yes
LEK-Gemma3-27B-Q4_K_M ~16G Tight
LEK-Llama-3.1-8B-Q4_K_M ~5G Yes
LEK-Mistral-7B-v0.3-Q4_K_M ~4G Yes
LEK-Qwen-2.5-7B-Q4_K_M ~4G Yes

Note: Models are on SMB mount. For CI/testing, copy locally or use t.Skip() when mount unavailable.