1 MLX Backend
Claude edited this page 2026-02-23 19:41:13 +00:00

MLX Backend

Native Go+Metal inference — no Python, no external processes.

Architecture

Go binary → CGo → mlx-c v0.4.1 → Metal GPU

Status: Working (2026-02-16). 23/23 probes PASS, stable at ~7.2GB RSS across 18+ sequential requests.

Packages

pkg/mlx/           CGo wrapper for mlx-c (array, ops, fast, stream, slice, io, nn, compile, random)
pkg/mlx/cache/     KV cache (KVCache + RotatingKVCache)
pkg/mlx/model/     Gemma 3 text model architecture
pkg/mlx/tokenizer/ Pure Go sentencepiece/BPE tokenizer
pkg/mlx/sample/    Token sampling (greedy, temp, topK, topP, minP)
pkg/ml/            Backend interface + MLXBackend implementation

Building

# On M3 (macOS arm64):
cd pkg/mlx && mkdir -p build && cd build && cmake .. && cmake --build . -j && cmake --install .
sed -i "" "/core-ide/d" go.work
CGO_ENABLED=1 go build -tags mlx -o core-mlx .

# On Linux: builds without MLX, Available()=false, other backends work
go build -o core .

Running

DYLD_LIBRARY_PATH=pkg/mlx/dist/lib ./core-mlx ml serve \
  --model-path /Volumes/Data/lem/LEM-gemma-3-12b-15k \
  --bind 0.0.0.0:8090

Default port: 8090, OpenAI-compatible API.

Memory Management

Critical lessons learned through debugging:

  1. Do NOT track inter-array references in Go — MLX-C has its own refcounting. Original code had desc.inputs chains preventing GC from collecting intermediates. After 3-4 requests: 170GB+ Metal memory → macOS kills process. Fix: removed Go-side ref tracking entirely.

  2. Force runtime.GC() every 4 tokens — Go GC can't see Metal/C memory pressure.

  3. mlx.ClearCache() releases Metal allocator cache — call after GC.

  4. SetMemoryLimit(24GB) + SetCacheLimit(16GB) — hard Metal ceilings.

  5. checkMemory() after each request — warns if active > 3x model size.

Quantized Inference

  • Mode: "affine" (NOT "default" — that crashes mlx-c)
  • Weight triplets: .weight (uint32 packed), .scales (bf16), .biases (bf16)
  • group_size=64, bits=4 — standard MLX community format
  • head_dim: Infer from q_proj weight shape, NOT from hidden_size/num_heads (Gemma3 uses 256, not 240)
  • SDPA mask modes: "" (no mask), "causal", "array" — NOT "none"
  • Logits: model.Forward returns [B,L,V], must slice to last position [B,V] before sampling

Branch and Workspace

  • Branch: feat/mlx-backend on host-uk/core
  • Binary: ~/Code/host-uk/packages/core/core-mlx (105MB)
  • mlx-c libs: pkg/mlx/dist/lib/ (libmlx.dylib, libmlxc.dylib)
  • Model tested: LEM-gemma-3-12b-15k (Gemma 3 12B, 4-bit quantized, 48 layers)

Key Commits

Commit Description
c8e6691 Support quantized inference (4-bit) for Gemma 3
70c3213 Use affine quantization mode and infer head_dim from weights
b6fbb88 Correct SDPA mask mode and slice logits to last position
478bbdd Add Metal cache management to prevent memory growth
a27a31f Add GC-based memory management (finalizers)
298c8d9 Remove Go-side array ref tracking, rely on MLX-C refcounting
d2bb19c Add Metal memory budget monitoring after each request