Native Apple Metal GPU inference via mlx-c bindings
  • Go 92%
  • C++ 6.1%
  • C 1.5%
  • Python 0.3%
Find a file
Snider 076de8f677
Some checks failed
security-scan.yml / refactor: remove expert_residency.go root shim (push) Failing after 0s
Test / Vet, Test & Build (push) Waiting to run
refactor: remove expert_residency.go root shim
Three callers updated (workload_bench.go, memory_plan.go, memory_plan_test.go) to import memory + probe + m2 packages directly and use memory.ExpertResidency*, probe.ExpertResidencyAction*, m2.PlanResidency, m2.NormalisePlan, m2.NewResidencyManager, m2.ResidencyLoader / Config / Manager. expert_residency.go deleted.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-05-11 20:02:14 +01:00
.core chore: add .core/ build and release configs 2026-03-06 18:52:37 +00:00
.forgejo/workflows Fix Gemma4 paged KV handling and scalar extraction 2026-05-07 23:24:26 +01:00
cpp feat(mlx): end-to-end GGUF + real LoRA merge + sampler chain fix 2026-04-14 13:44:13 +01:00
docs refactor(mlx): split compute → dappco.re/go/mlx/compute subpackage 2026-05-11 11:36:55 +01:00
external refactor(scheduler): lift scheduler to go-inference/scheduler/ 2026-05-11 18:01:53 +01:00
go refactor: remove expert_residency.go root shim 2026-05-11 20:02:14 +01:00
lib feat(mlx): end-to-end GGUF + real LoRA merge + sampler chain fix 2026-04-14 13:44:13 +01:00
patches chore(repo): vendor mlx metallib-path patch, sync go.work.sum, drop test_cgo_d 2026-05-08 10:43:18 +01:00
scripts Fix Gemma4 paged KV handling and scalar extraction 2026-05-07 23:24:26 +01:00
.codecov.yml Fix Gemma4 paged KV handling and scalar extraction 2026-05-07 23:24:26 +01:00
.editorconfig chore: add Go repo norms (badges, contributing, lint, taskfile, editorconfig) 2026-02-23 06:45:42 +00:00
.gitignore Update .gitignore 2026-05-08 11:02:46 +01:00
.gitmodules refactor(go): restructure to /go/ subtree (Mantis #1241) 2026-05-01 06:43:49 +01:00
.golangci.yml chore: add Go repo norms (badges, contributing, lint, taskfile, editorconfig) 2026-02-23 06:45:42 +00:00
AGENTS.md docs: refresh CLAUDE/README/AGENTS and document expanded feature surface 2026-05-08 12:18:41 +01:00
CLAUDE.md docs: refresh CLAUDE/README/AGENTS and document expanded feature surface 2026-05-08 12:18:41 +01:00
CMakeLists.txt feat(mlx): end-to-end GGUF + real LoRA merge + sampler chain fix 2026-04-14 13:44:13 +01:00
CONTRIBUTING.md fix(mlx): r4 — deferred frees + LICENCE + distributed opt-out guard on PR #3 2026-04-27 19:26:39 +01:00
go.work chore(repo): refresh submodules + go.work hygiene (Phase 2 cascade unblock) 2026-05-01 09:40:36 +01:00
go.work.sum chore(repo): vendor mlx metallib-path patch, sync go.work.sum, drop test_cgo_d 2026-05-08 10:43:18 +01:00
LICENCE fix(mlx): r4 — deferred frees + LICENCE + distributed opt-out guard on PR #3 2026-04-27 19:26:39 +01:00
README.md docs: refresh CLAUDE/README/AGENTS and document expanded feature surface 2026-05-08 12:18:41 +01:00

Go Reference Licence: EUPL-1.2 Go Version

go-mlx

Native Apple Metal GPU inference via mlx-c CGO bindings, implementing the inference.Backend and inference.TextModel interfaces from go-inference for Apple Silicon (M1-M4). Supports Gemma 3, Gemma 4 (dense and MoE), Qwen 2/3, and Llama 3 architectures from HuggingFace safetensors directories and GGUF checkpoints, with fused Metal kernels for RMSNorm, RoPE, scaled dot-product attention, KV cache management, LoRA fine-tuning with AdamW, and batch inference. The root package also exposes an RFC-style direct model API (mlx.LoadModel, model.Generate, model.GenerateStream) and a non-LLM frame-compute API (mlx.NewSession, Session.BeginFrame, Session.FinishFrame, PixelBuffer, KernelRGB565ToRGBA8, KernelNearestScale, KernelScanlineFilter, KernelCRTFilter, KernelSoftenFilter, KernelSharpenFilter) for Apple GPU-accelerated image and emulator workloads. A Python subprocess backend (mlxlm) is provided as a CGO-free alternative. Platform-restricted: darwin/arm64 only; a no-op stub compiles on all other platforms.

Module: dappco.re/go/mlx Licence: EUPL-1.2 Language: Go 1.26

Quick Start

import (
    "context"
    "fmt"

    "dappco.re/go/inference"
    _ "dappco.re/go/mlx"  // registers "metal" backend via init()
)

model, err := inference.LoadModel("/Volumes/Data/lem/safetensors/gemma-3-1b/")
if err != nil {
    panic(err)
}
defer model.Close()

for tok := range model.Generate(context.Background(), "Hello", inference.WithMaxTokens(256)) {
    fmt.Print(tok.Text)
}
if err := model.Err(); err != nil {
    panic(err)
}

Root API

import (
    "fmt"

    mlx "dappco.re/go/mlx"
)

model, err := mlx.LoadModel("/path/to/model",
    mlx.WithContextLength(8192),
    mlx.WithQuantization(4),
    mlx.WithDevice("gpu"),
)
if err != nil {
    panic(err)
}
defer model.Close()

reply, err := model.Generate("Explain Gemma 4 shared KV layers", mlx.WithMaxTokens(128))
if err != nil {
    panic(err)
}
fmt.Println(reply)

Frame Compute

import mlx "dappco.re/go/mlx"

session, err := mlx.NewSession(mlx.WithSessionLabel("frame-pipeline"))
if err != nil {
    panic(err)
}
defer session.Close()

src, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
    Width:  320,
    Height: 224,
    Stride: 640,
    Format: mlx.PixelRGB565,
})
if err != nil {
    panic(err)
}
rgba, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
    Width:  320,
    Height: 224,
    Stride: 1280,
    Format: mlx.PixelRGBA8,
})
if err != nil {
    panic(err)
}
scaled, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
    Width:  960,
    Height: 672,
    Stride: 3840,
    Format: mlx.PixelRGBA8,
})
if err != nil {
    panic(err)
}

frameBytes := make([]byte, src.Descriptor().SizeBytes())
if err := src.Upload(frameBytes); err != nil {
    panic(err)
}
if err := session.BeginFrame(); err != nil {
    panic(err)
}
if err := session.Run(mlx.KernelRGB565ToRGBA8, mlx.KernelArgs{
    Inputs:  map[string]mlx.Buffer{"src": src},
    Outputs: map[string]mlx.Buffer{"dst": rgba},
}); err != nil {
    panic(err)
}
if err := session.Run(mlx.KernelNearestScale, mlx.KernelArgs{
    Inputs:  map[string]mlx.Buffer{"src": rgba},
    Outputs: map[string]mlx.Buffer{"dst": scaled},
}); err != nil {
    panic(err)
}
if err := session.Run(mlx.KernelScanlineFilter, mlx.KernelArgs{
    Inputs:  map[string]mlx.Buffer{"src": scaled},
    Outputs: map[string]mlx.Buffer{"dst": scaled},
    Scalars: map[string]float64{"strength": 0.3},
}); err != nil {
    panic(err)
}
frameMetrics, err := session.FinishFrame()
if err != nil {
    panic(err)
}

finalFrame, err := scaled.Read()
if err != nil {
    panic(err)
}
_ = finalFrame
_ = frameMetrics

Research-Grade Pipeline

go-mlx is positioned as a Go-native research-grade model runner — not just inference. The root package exposes the full training and operations pipeline so harnesses can stop reaching for Python mlx-lm:

Feature Function What it does
LoRA fine-tuning mlx.ApplyLoRA + mlx.NewAdamW Low-rank adaptation training with AdamW, mixed precision, gradient checkpointing
LoRA fusion mlx.FuseLoRAIntoModelPack(ctx, opts) Bake a trained LoRA adapter into the base model as a fresh safetensors pack
Knowledge distillation mlx.RunKnowledgeDistillation(ctx, runner, dataset, cfg) KL or soft-CE loss against a teacher's logits, with checkpoint resumption
GRPO mlx.RunGRPOReasoningTraining(ctx, runner, dataset, cfg) Group-relative policy optimisation with reward functions and reference KL
Eval mlx.RunModelEval(ctx, model, dataset, cfg) Dataset-native perplexity plus pluggable quality probes
Model merge mlx.MergeModelPacks(ctx, opts) Linear / SLERP / TIES / DARE merging of multiple model packs with provenance
GGUF quantise mlx.QuantizeModelPackToGGUF(ctx, opts) Native Go safetensors → GGUF Q8_0 / Q4_0 / Q4_K_M
KV snapshot snapshot.Save(path) / mlx.LoadKVSnapshot(path) Portable binary KV cache (Float32 or Q8 symmetric int8) for session restore
HF fit mlx.PlanHFModelFits(ctx, cfg) HuggingFace Hub metadata search to plan what fits on local hardware
Attention probe inference.AttentionInspector adapter Extract post-RoPE K vectors per head per layer for analysis

See docs/ and examples/ for the full surface.

Documentation

  • Compute Guide — frame-oriented Metal compute sessions, pixel buffers, kernels, metrics
  • Architecture — CGO binding, model architectures, weight loading, KV cache, attention, batch inference, LoRA training, mlxlm backend
  • Models — model loading, supported architectures, tokenisation, chat templates
  • Training — LoRA fine-tuning, AdamW, gradient computation, checkpoints, fusion
  • Distillation — knowledge distillation (KL, soft cross-entropy)
  • GRPO — group-relative policy optimisation for RL
  • Eval — dataset-native perplexity, quality probes, eval reports
  • Model Operations — merge, GGUF quantise, KV snapshot, HF fit
  • Development Guide — prerequisites (mlx-c CMake build), CGO flags, test patterns, benchmarks
  • Project History — completed phases, commit hashes, known limitations
  • Examples — runnable usage examples organised by type

Build & Test

git submodule update --init --recursive
go generate ./...        # builds mlx-c C library (required first time)
go test ./...
go build ./...

Licence

European Union Public Licence 1.2 — see LICENCE for details.