MLX Backend

Native Go+Metal inference — no Python, no external processes.

Architecture

Go binary → CGo → mlx-c v0.4.1 → Metal GPU

Status: Working (2026-02-16). 23/23 probes PASS, stable at ~7.2GB RSS across 18+ sequential requests.

Packages

pkg/mlx/           CGo wrapper for mlx-c (array, ops, fast, stream, slice, io, nn, compile, random)
pkg/mlx/cache/     KV cache (KVCache + RotatingKVCache)
pkg/mlx/model/     Gemma 3 text model architecture
pkg/mlx/tokenizer/ Pure Go sentencepiece/BPE tokenizer
pkg/mlx/sample/    Token sampling (greedy, temp, topK, topP, minP)
pkg/ml/            Backend interface + MLXBackend implementation

Building

# On M3 (macOS arm64):
cd pkg/mlx && mkdir -p build && cd build && cmake .. && cmake --build . -j && cmake --install .
sed -i "" "/core-ide/d" go.work
CGO_ENABLED=1 go build -tags mlx -o core-mlx .

# On Linux: builds without MLX, Available()=false, other backends work
go build -o core .

Running

DYLD_LIBRARY_PATH=pkg/mlx/dist/lib ./core-mlx ml serve \
  --model-path /Volumes/Data/lem/LEM-gemma-3-12b-15k \
  --bind 0.0.0.0:8090

Default port: 8090, OpenAI-compatible API.

Memory Management

Critical lessons learned through debugging:

Do NOT track inter-array references in Go — MLX-C has its own refcounting. Original code had desc.inputs chains preventing GC from collecting intermediates. After 3-4 requests: 170GB+ Metal memory → macOS kills process. Fix: removed Go-side ref tracking entirely.
Force runtime.GC() every 4 tokens — Go GC can't see Metal/C memory pressure.
mlx.ClearCache() releases Metal allocator cache — call after GC.
SetMemoryLimit(24GB) + SetCacheLimit(16GB) — hard Metal ceilings.
checkMemory() after each request — warns if active > 3x model size.

Quantized Inference

Mode: "affine" (NOT "default" — that crashes mlx-c)
Weight triplets: .weight (uint32 packed), .scales (bf16), .biases (bf16)
group_size=64, bits=4 — standard MLX community format
head_dim: Infer from q_proj weight shape, NOT from hidden_size/num_heads (Gemma3 uses 256, not 240)
SDPA mask modes: "" (no mask), "causal", "array" — NOT "none"
Logits: model.Forward returns [B,L,V], must slice to last position [B,V] before sampling

Branch and Workspace

Branch: feat/mlx-backend on host-uk/core
Binary: ~/Code/host-uk/packages/core/core-mlx (105MB)
mlx-c libs: pkg/mlx/dist/lib/ (libmlx.dylib, libmlxc.dylib)
Model tested: LEM-gemma-3-12b-15k (Gemma 3 12B, 4-bit quantized, 48 layers)

Key Commits

Commit	Description
c8e6691	Support quantized inference (4-bit) for Gemma 3
70c3213	Use affine quantization mode and infer head_dim from weights
b6fbb88	Correct SDPA mask mode and slice logits to last position
478bbdd	Add Metal cache management to prevent memory growth
a27a31f	Add GC-based memory management (finalizers)
298c8d9	Remove Go-side array ref tracking, rely on MLX-C refcounting
d2bb19c	Add Metal memory budget monitoring after each request