Native Apple Metal GPU inference via mlx-c bindings

Go 92%
C++ 6.1%
C 1.5%
Python 0.3%

Find a file

Snider 076de8f677 Some checks failed security-scan.yml / refactor: remove expert_residency.go root shim (push) Failing after 0s Details Test / Vet, Test & Build (push) Waiting to run Details refactor: remove expert_residency.go root shim Three callers updated (workload_bench.go, memory_plan.go, memory_plan_test.go) to import memory + probe + m2 packages directly and use memory.ExpertResidency, probe.ExpertResidencyAction, m2.PlanResidency, m2.NormalisePlan, m2.NewResidencyManager, m2.ResidencyLoader / Config / Manager. expert_residency.go deleted. Co-Authored-By: Virgil <virgil@lethean.io>		2026-05-11 20:02:14 +01:00
.core	chore: add .core/ build and release configs	2026-03-06 18:52:37 +00:00
.forgejo/workflows	Fix Gemma4 paged KV handling and scalar extraction	2026-05-07 23:24:26 +01:00
cpp	feat(mlx): end-to-end GGUF + real LoRA merge + sampler chain fix	2026-04-14 13:44:13 +01:00
docs	refactor(mlx): split compute → dappco.re/go/mlx/compute subpackage	2026-05-11 11:36:55 +01:00
external	refactor(scheduler): lift scheduler to go-inference/scheduler/	2026-05-11 18:01:53 +01:00
go	refactor: remove expert_residency.go root shim	2026-05-11 20:02:14 +01:00
lib	feat(mlx): end-to-end GGUF + real LoRA merge + sampler chain fix	2026-04-14 13:44:13 +01:00
patches	chore(repo): vendor mlx metallib-path patch, sync go.work.sum, drop test_cgo_d	2026-05-08 10:43:18 +01:00
scripts	Fix Gemma4 paged KV handling and scalar extraction	2026-05-07 23:24:26 +01:00
.codecov.yml	Fix Gemma4 paged KV handling and scalar extraction	2026-05-07 23:24:26 +01:00
.editorconfig	chore: add Go repo norms (badges, contributing, lint, taskfile, editorconfig)	2026-02-23 06:45:42 +00:00
.gitignore	Update .gitignore	2026-05-08 11:02:46 +01:00
.gitmodules	refactor(go): restructure to /go/ subtree (Mantis #1241 )	2026-05-01 06:43:49 +01:00
.golangci.yml	chore: add Go repo norms (badges, contributing, lint, taskfile, editorconfig)	2026-02-23 06:45:42 +00:00
AGENTS.md	docs: refresh CLAUDE/README/AGENTS and document expanded feature surface	2026-05-08 12:18:41 +01:00
CLAUDE.md	docs: refresh CLAUDE/README/AGENTS and document expanded feature surface	2026-05-08 12:18:41 +01:00
CMakeLists.txt	feat(mlx): end-to-end GGUF + real LoRA merge + sampler chain fix	2026-04-14 13:44:13 +01:00
CONTRIBUTING.md	fix(mlx): r4 — deferred frees + LICENCE + distributed opt-out guard on PR #3	2026-04-27 19:26:39 +01:00
go.work	chore(repo): refresh submodules + go.work hygiene (Phase 2 cascade unblock)	2026-05-01 09:40:36 +01:00
go.work.sum	chore(repo): vendor mlx metallib-path patch, sync go.work.sum, drop test_cgo_d	2026-05-08 10:43:18 +01:00
LICENCE	fix(mlx): r4 — deferred frees + LICENCE + distributed opt-out guard on PR #3	2026-04-27 19:26:39 +01:00
README.md	docs: refresh CLAUDE/README/AGENTS and document expanded feature surface	2026-05-08 12:18:41 +01:00

README.md

go-mlx

Native Apple Metal GPU inference via mlx-c CGO bindings, implementing the inference.Backend and inference.TextModel interfaces from go-inference for Apple Silicon (M1-M4). Supports Gemma 3, Gemma 4 (dense and MoE), Qwen 2/3, and Llama 3 architectures from HuggingFace safetensors directories and GGUF checkpoints, with fused Metal kernels for RMSNorm, RoPE, scaled dot-product attention, KV cache management, LoRA fine-tuning with AdamW, and batch inference. The root package also exposes an RFC-style direct model API (mlx.LoadModel, model.Generate, model.GenerateStream) and a non-LLM frame-compute API (mlx.NewSession, Session.BeginFrame, Session.FinishFrame, PixelBuffer, KernelRGB565ToRGBA8, KernelNearestScale, KernelScanlineFilter, KernelCRTFilter, KernelSoftenFilter, KernelSharpenFilter) for Apple GPU-accelerated image and emulator workloads. A Python subprocess backend (mlxlm) is provided as a CGO-free alternative. Platform-restricted: darwin/arm64 only; a no-op stub compiles on all other platforms.

Module: dappco.re/go/mlx Licence: EUPL-1.2 Language: Go 1.26

Quick Start

import (
    "context"
    "fmt"

    "dappco.re/go/inference"
    _ "dappco.re/go/mlx"  // registers "metal" backend via init()
)

model, err := inference.LoadModel("/Volumes/Data/lem/safetensors/gemma-3-1b/")
if err != nil {
    panic(err)
}
defer model.Close()

for tok := range model.Generate(context.Background(), "Hello", inference.WithMaxTokens(256)) {
    fmt.Print(tok.Text)
}
if err := model.Err(); err != nil {
    panic(err)
}

Root API

import (
    "fmt"

    mlx "dappco.re/go/mlx"
)

model, err := mlx.LoadModel("/path/to/model",
    mlx.WithContextLength(8192),
    mlx.WithQuantization(4),
    mlx.WithDevice("gpu"),
)
if err != nil {
    panic(err)
}
defer model.Close()

reply, err := model.Generate("Explain Gemma 4 shared KV layers", mlx.WithMaxTokens(128))
if err != nil {
    panic(err)
}
fmt.Println(reply)

Frame Compute

import mlx "dappco.re/go/mlx"

session, err := mlx.NewSession(mlx.WithSessionLabel("frame-pipeline"))
if err != nil {
    panic(err)
}
defer session.Close()

src, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
    Width:  320,
    Height: 224,
    Stride: 640,
    Format: mlx.PixelRGB565,
})
if err != nil {
    panic(err)
}
rgba, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
    Width:  320,
    Height: 224,
    Stride: 1280,
    Format: mlx.PixelRGBA8,
})
if err != nil {
    panic(err)
}
scaled, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
    Width:  960,
    Height: 672,
    Stride: 3840,
    Format: mlx.PixelRGBA8,
})
if err != nil {
    panic(err)
}

frameBytes := make([]byte, src.Descriptor().SizeBytes())
if err := src.Upload(frameBytes); err != nil {
    panic(err)
}
if err := session.BeginFrame(); err != nil {
    panic(err)
}
if err := session.Run(mlx.KernelRGB565ToRGBA8, mlx.KernelArgs{
    Inputs:  map[string]mlx.Buffer{"src": src},
    Outputs: map[string]mlx.Buffer{"dst": rgba},
}); err != nil {
    panic(err)
}
if err := session.Run(mlx.KernelNearestScale, mlx.KernelArgs{
    Inputs:  map[string]mlx.Buffer{"src": rgba},
    Outputs: map[string]mlx.Buffer{"dst": scaled},
}); err != nil {
    panic(err)
}
if err := session.Run(mlx.KernelScanlineFilter, mlx.KernelArgs{
    Inputs:  map[string]mlx.Buffer{"src": scaled},
    Outputs: map[string]mlx.Buffer{"dst": scaled},
    Scalars: map[string]float64{"strength": 0.3},
}); err != nil {
    panic(err)
}
frameMetrics, err := session.FinishFrame()
if err != nil {
    panic(err)
}

finalFrame, err := scaled.Read()
if err != nil {
    panic(err)
}
_ = finalFrame
_ = frameMetrics

Research-Grade Pipeline

go-mlx is positioned as a Go-native research-grade model runner — not just inference. The root package exposes the full training and operations pipeline so harnesses can stop reaching for Python mlx-lm:

Feature	Function	What it does
LoRA fine-tuning	`mlx.ApplyLoRA` + `mlx.NewAdamW`	Low-rank adaptation training with AdamW, mixed precision, gradient checkpointing
LoRA fusion	`mlx.FuseLoRAIntoModelPack(ctx, opts)`	Bake a trained LoRA adapter into the base model as a fresh safetensors pack
Knowledge distillation	`mlx.RunKnowledgeDistillation(ctx, runner, dataset, cfg)`	KL or soft-CE loss against a teacher's logits, with checkpoint resumption
GRPO	`mlx.RunGRPOReasoningTraining(ctx, runner, dataset, cfg)`	Group-relative policy optimisation with reward functions and reference KL
Eval	`mlx.RunModelEval(ctx, model, dataset, cfg)`	Dataset-native perplexity plus pluggable quality probes
Model merge	`mlx.MergeModelPacks(ctx, opts)`	Linear / SLERP / TIES / DARE merging of multiple model packs with provenance
GGUF quantise	`mlx.QuantizeModelPackToGGUF(ctx, opts)`	Native Go safetensors → GGUF Q8_0 / Q4_0 / Q4_K_M
KV snapshot	`snapshot.Save(path)` / `mlx.LoadKVSnapshot(path)`	Portable binary KV cache (Float32 or Q8 symmetric int8) for session restore
HF fit	`mlx.PlanHFModelFits(ctx, cfg)`	HuggingFace Hub metadata search to plan what fits on local hardware
Attention probe	`inference.AttentionInspector` adapter	Extract post-RoPE K vectors per head per layer for analysis

See docs/ and examples/ for the full surface.

Documentation

Compute Guide — frame-oriented Metal compute sessions, pixel buffers, kernels, metrics
Architecture — CGO binding, model architectures, weight loading, KV cache, attention, batch inference, LoRA training, mlxlm backend
Models — model loading, supported architectures, tokenisation, chat templates
Training — LoRA fine-tuning, AdamW, gradient computation, checkpoints, fusion
Distillation — knowledge distillation (KL, soft cross-entropy)
GRPO — group-relative policy optimisation for RL
Eval — dataset-native perplexity, quality probes, eval reports
Model Operations — merge, GGUF quantise, KV snapshot, HF fit
Development Guide — prerequisites (mlx-c CMake build), CGO flags, test patterns, benchmarks
Project History — completed phases, commit hashes, known limitations
Examples — runnable usage examples organised by type

Build & Test

git submodule update --init --recursive
go generate ./...        # builds mlx-c C library (required first time)
go test ./...
go build ./...

Licence

European Union Public Licence 1.2 — see LICENCE for details.