Architecture

Layer Diagram

Go Application
    |
    v
model/ (Gemma3, Qwen3)  <-- High-level model interface
    |
    +-- tokenizer/       <-- BPE encode/decode
    +-- sample/          <-- Temperature, top-k, top-p
    +-- cache/           <-- KV cache management
    |
    v
mlx (root package)       <-- Core ops + Array type
    |
    +-- ops.go           <-- MatMul, Softmax, Add, etc.
    +-- fast.go          <-- Fused Metal kernels (RMSNorm, RoPE, SDPA)
    +-- nn.go            <-- Linear, Embedding, RMSNorm layers
    +-- grad.go          <-- VJP gradient computation
    +-- lora.go          <-- LoRA adapter
    +-- optim.go         <-- AdamW optimiser
    |
    v
CGO Bridge (mlx.go)      <-- #cgo directives, C function calls
    |
    v
mlx-c v0.4.1            <-- C API for MLX (fetched by CMake)
    |
    v
Apple MLX                <-- Metal GPU compute shaders
    |
    v
Metal / Accelerate       <-- Apple Silicon GPU + CPU frameworks

Array Type

*mlx.Array is the fundamental data type. It wraps a C mlx_array handle and supports:

Creation from Go slices (NewArray)
GPU materialisation (Materialize, MaterializeAsync)
Element-wise operations (Add, Multiply, MatMul)
Shape manipulation (Reshape, Transpose)
Data type casting (AsType)

Arrays are lazily evaluated — operations build a computation graph that only executes when Materialize() is called.

Memory Model

Arrays use runtime.SetFinalizer for C-side deallocation
No explicit Close() method — relies on GC
Under sustained inference, GC pressure triggers cleanup
Materialize() forces synchronous GPU evaluation