Snider c881813872 docs(design): backend abstraction with internal/metal reorganisation

Approved design for restructuring go-mlx:
- Root package becomes clean interface (TextModel, LoadModel, Token)
- All CGO code moves to internal/metal/
- Deterministic memory management (Close + per-step cleanup)
- Error propagation instead of silent logging
- mlxlm/ backend placeholder for Python subprocess support

Includes API breaking change communication in FINDINGS.md and
memory management research tasks in cpp/TODO.md.

See: docs/plans/2026-02-19-backend-abstraction-design.md

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-19 19:12:04 +00:00

9.8 KiB

Raw Blame History

FINDINGS.md — go-mlx Research & Discovery

Record findings, gaps, and architectural decisions here as work progresses.

2026-02-19: Split from go-ai (Virgil)

Origin

This package was extracted from forge.lthn.ai/core/go-ai/mlx/. The split was motivated by:

Platform isolation — mlx is darwin/arm64 only with CGO + CMake build. Keeping it in go-ai forces the entire AI package to deal with platform-specific build complexity.
Dependency chain — go-i18n Phase 2a needs MLX inference for Gemma3-1B domain classification. A standalone go-mlx module can be imported directly without pulling in all of go-ai (DuckDB, Parquet, gRPC, Ollama, etc.).
Build tag simplicity — Every file is //go:build darwin && arm64. As a standalone module, this is clean. Inside go-ai, it was a special case that required careful handling.

What Was Extracted

Directory	Files	LOC	Purpose
Root (`mlx/`)	16	~2,500	Core MLX bindings, ops, training
`model/`	3	~800	Gemma3, Qwen3 model implementations
`tokenizer/`	1	~324	BPE tokenizer
`sample/`	1	~150	Sampling strategies
`cache/`	1	~201	KV cache for inference
Total	22	~4,354

Import Path Changes

All internal imports rewritten:

forge.lthn.ai/core/go-ai/mlx → forge.lthn.ai/core/go-mlx
forge.lthn.ai/core/go-ai/mlx/cache → forge.lthn.ai/core/go-mlx/cache
forge.lthn.ai/core/go-ai/mlx/tokenizer → forge.lthn.ai/core/go-mlx/tokenizer
forge.lthn.ai/core/go-ai/mlx/model → forge.lthn.ai/core/go-mlx/model
forge.lthn.ai/core/go-ai/mlx/sample → forge.lthn.ai/core/go-mlx/sample

Upstream Consumer

go-ai/ml/backend_mlx.go is the only file outside mlx/ that imports it. After split, go-ai needs either:

A replace directive: replace forge.lthn.ai/core/go-mlx => ../go-mlx
Or a published module version

What Stayed in go-ai

ml/backend_mlx.go (253 LOC) — the Backend adapter that calls go-mlx. This stays in go-ai because it implements the go-ai-specific Backend interface.
test-mlx.go — integration test utility (go-ai root). Needs updating to import from go-mlx.
TEST-RESULTS.md — comprehensive test report (stays as historical record).

2026-02-19: Test Coverage Assessment

Tested (3 test files)

File	Tests	Coverage
`grad_test.go`	VJP/gradient computation	Good — tests forward+backward pass
`lora_test.go`	LoRA adapter	Good — tests apply/merge/save
`optim_test.go`	AdamW optimiser	Good — tests step/state

Not Tested (critical gaps)

File	LOC	Risk	Notes
`ops.go`	353	High	MatMul, Softmax, element-wise ops — core of everything
`array.go`	261	High	Array creation, reshape, data access — foundational
`nn.go`	~150	Medium	Linear, Embedding, RMSNorm layers
`fast.go`	~100	Medium	Fused Metal kernels (RoPE, ScaledDotProduct)
`model/*.go`	~800	High	No tests for Gemma3/Qwen3 forward pass
`tokenizer/`	324	High	No BPE encode/decode tests
`sample/`	~150	Medium	No sampling tests
`cache/`	201	Medium	No KV cache tests
`io.go`	~100	Medium	No safetensors load tests

Error Handling

The error handler in mlx.go stores the last error in a C static variable and logs it via slog.Error. This is not propagated to Go callers. Functions like MatMul, Softmax, etc. return *Array with no error — if the C operation fails, the caller gets a nil/invalid array with no indication why.

Memory Management

Arrays use runtime.SetFinalizer for C-side deallocation. Under sustained inference (1000+ tokens), this relies on GC pressure to trigger finalizers. No explicit Close() or Free() method exists on Array — could leak under high throughput if GC doesn't keep up.

2026-02-19: Dependency Chain

go-i18n (Phase 2a: domain classification)
    └── needs Gemma3-1B inference
        └── go-mlx (this package)
            └── mlx-c v0.4.1 (CMake, fetched from GitHub)
                └── Apple MLX (Metal GPU compute)

go-ai/ml/backend_mlx.go
    └── imports go-mlx
        └── implements go-ai Backend interface

LEM Lab Connection

LEM Lab (the native MLX chat UI at localhost:8090) also uses this code path. Currently working with Qwen3-8B streaming. The model/ directory supports both Gemma3 and Qwen3.

2026-02-19: Hardware Test Results (from go-ai TEST-RESULTS.md)

Tested on Mac Studio M3 Ultra (32-core CPU, 60-core GPU, 96GB unified memory):

All 84 go-ai tests pass (including 3 mlx tests)
MLX grad, lora, optim tests all pass
Go 1.25.7, mlx-c v0.4.1

Model Inventory (safetensors)

Available on /Volumes/Data/lem/safetensors/:

Gemma3-1B, Gemma3-4B, Gemma3-27B
Qwen3-8B (used by LEM Lab)

2026-02-19: Go 1.26 Impact Assessment

Source: https://go.dev/doc/go1.26

High Impact (free performance, no code changes)

CGO call overhead reduced ~30% Every MLX operation (MatMul, Add, Softmax, RoPE, etc.) crosses the CGO boundary. The runtime previously used a dedicated syscall P state for cgo calls; Go 1.26 removes that and checks goroutine status instead. This is a direct, automatic performance win for the entire package.

Green Tea GC now default (10-40% less GC overhead) Critical for go-mlx because Array objects use runtime.SetFinalizer for C-side deallocation via mlx_*_free(). Reduced GC overhead means:

More timely finaliser execution during sustained inference
Less memory pressure from stale Array objects waiting for GC
The FINDINGS.md concern about "GC not keeping up under high throughput" is partially mitigated
Opt-out: GOEXPERIMENT=nogreenteagc (temporary, removed in 1.27)

Medium Impact

Slice stack allocation in more situations The compiler can now allocate slice backing stores on the stack more often. Benefits small temporary slices in Collect(), shape manipulation, and internal ops helpers. Debug: -compile=variablemakehash flag.

testing.B.Loop inlining fix When we add benchmarks (Phase 1), b.Loop() style now properly inlines loop bodies. Important for micro-benchmarks of small ops like Add, Multiply.

Heap base address randomisation (64-bit) Security improvement for CGO programs. Randomises heap base at startup. Disable: GOEXPERIMENT=norandomizedheapbase64.

Clarification on Range-over-func

Virgil's Phase 6 TODO mentions "if 1.26 stabilises range-over-func". Range-over-func has been stable since Go 1.23 and the iter package was added in 1.23. Since go.mod is already at Go 1.25.5, Array.Iter() iter.Seq[float32] can be implemented today without a version bump. Go 1.26 adds no new iterator features beyond what 1.23-1.25 provide.

Recommendation

No Go version bump needed for the performance wins — they're automatic at runtime. The only code-level Go 1.26 feature that matters is testing.ArtifactDir() for benchmark result storage, which is minor. Focus remains on Phase 1 hardening.

2026-02-19: go-ai Split Context

Virgil is splitting go-ai into sub-packages, with go-ai becoming a meta/catch-all for ML features. go-mlx was the first extraction. This means:

More packages will follow the go-mlx pattern (standalone module, own build, own tests)
go-ai will eventually be a thin layer importing sub-packages
The replace directive approach works for development; published modules for releases

2026-02-19: Floats()/DataInt32() Unsafe on Non-Contiguous Arrays

Discovery: Array.Floats() and Array.DataInt32() read Size() elements from the raw C data pointer (mlx_array_data_float32). For non-contiguous arrays (transpose, broadcast, slice views), the physical memory layout doesn't match the logical layout. Reading Size() contiguous elements returns incorrect data or reads past the physical buffer.

Affected operations: Transpose(), BroadcastTo(), SliceAxis(), Slice(), AsStrided() — any operation that creates a view rather than a copy.

Workaround: Reshape(arr, totalSize) forces a contiguous copy before reading flat data. All tests use this pattern for view operations.

Fix needed (Phase 4): Either:

Add a Contiguous() method that wraps mlx_contiguous (if available in mlx-c)
Or have Floats()/DataInt32() automatically force contiguity before reading
Document the behaviour clearly if views are intentionally lazy

This is a data correctness issue — silent wrong results, not a crash.

2026-02-19: Backend Abstraction — API Breaking Change

Design doc: docs/plans/2026-02-19-backend-abstraction-design.md

What's changing

The entire public API is being replaced. All CGO code moves to internal/metal/. The root package becomes a clean interface layer:

m, _ := mlx.LoadModel("/path/to/model/")
defer m.Close()
for tok := range m.Generate("prompt", mlx.WithMaxTokens(128)) {
    fmt.Print(tok.Text)
}

The old API (Array, MatMul, model.LoadModel, model.Model, etc.) will no longer be public.

Impact on go-ai

backend_mlx.go currently imports root-level Array, ops, model types. These move to internal/metal/ and become inaccessible. Migration: replace direct tensor manipulation with mlx.LoadModel() + mlx.TextModel.Generate().

Impact on go-i18n

The API for Gemma3-1B domain classification will be:

m, _ := mlx.LoadModel("/path/to/gemma-3-1b/")
for tok := range m.Generate(sentence, mlx.WithMaxTokens(32)) { ... }

Streaming via iter.Seq[Token]. No tokenisation or sampling to handle.

Memory leak fix included

The refactor includes deterministic memory management — TextModel.Close() for model weights and per-step intermediate cleanup during generation. This addresses the current production blocker.

9.8 KiB Raw Blame History