MLX Backend
Native Go+Metal inference — no Python, no external processes.
Architecture
Go binary → CGo → mlx-c v0.4.1 → Metal GPU
Status: Working (2026-02-16). 23/23 probes PASS, stable at ~7.2GB RSS across 18+ sequential requests.
Packages
pkg/mlx/ CGo wrapper for mlx-c (array, ops, fast, stream, slice, io, nn, compile, random)
pkg/mlx/cache/ KV cache (KVCache + RotatingKVCache)
pkg/mlx/model/ Gemma 3 text model architecture
pkg/mlx/tokenizer/ Pure Go sentencepiece/BPE tokenizer
pkg/mlx/sample/ Token sampling (greedy, temp, topK, topP, minP)
pkg/ml/ Backend interface + MLXBackend implementation
Building
# On M3 (macOS arm64):
cd pkg/mlx && mkdir -p build && cd build && cmake .. && cmake --build . -j && cmake --install .
sed -i "" "/core-ide/d" go.work
CGO_ENABLED=1 go build -tags mlx -o core-mlx .
# On Linux: builds without MLX, Available()=false, other backends work
go build -o core .
Running
DYLD_LIBRARY_PATH=pkg/mlx/dist/lib ./core-mlx ml serve \
--model-path /Volumes/Data/lem/LEM-gemma-3-12b-15k \
--bind 0.0.0.0:8090
Default port: 8090, OpenAI-compatible API.
Memory Management
Critical lessons learned through debugging:
-
Do NOT track inter-array references in Go — MLX-C has its own refcounting. Original code had
desc.inputschains preventing GC from collecting intermediates. After 3-4 requests: 170GB+ Metal memory → macOS kills process. Fix: removed Go-side ref tracking entirely. -
Force
runtime.GC()every 4 tokens — Go GC can't see Metal/C memory pressure. -
mlx.ClearCache()releases Metal allocator cache — call after GC. -
SetMemoryLimit(24GB)+SetCacheLimit(16GB)— hard Metal ceilings. -
checkMemory()after each request — warns if active > 3x model size.
Quantized Inference
- Mode:
"affine"(NOT"default"— that crashes mlx-c) - Weight triplets:
.weight(uint32 packed),.scales(bf16),.biases(bf16) - group_size=64, bits=4 — standard MLX community format
- head_dim: Infer from q_proj weight shape, NOT from hidden_size/num_heads (Gemma3 uses 256, not 240)
- SDPA mask modes:
""(no mask),"causal","array"— NOT"none" - Logits: model.Forward returns [B,L,V], must slice to last position [B,V] before sampling
Branch and Workspace
- Branch:
feat/mlx-backendonhost-uk/core - Binary:
~/Code/host-uk/packages/core/core-mlx(105MB) - mlx-c libs:
pkg/mlx/dist/lib/(libmlx.dylib, libmlxc.dylib) - Model tested: LEM-gemma-3-12b-15k (Gemma 3 12B, 4-bit quantized, 48 layers)
Key Commits
| Commit | Description |
|---|---|
| c8e6691 | Support quantized inference (4-bit) for Gemma 3 |
| 70c3213 | Use affine quantization mode and infer head_dim from weights |
| b6fbb88 | Correct SDPA mask mode and slice logits to last position |
| 478bbdd | Add Metal cache management to prevent memory growth |
| a27a31f | Add GC-based memory management (finalizers) |
| 298c8d9 | Remove Go-side array ref tracking, rely on MLX-C refcounting |
| d2bb19c | Add Metal memory budget monitoring after each request |