go-mlx/FINDINGS.md

# FINDINGS.md — go-mlx Research & Discovery

Record findings, gaps, and architectural decisions here as work progresses.

---

## 2026-02-19: Split from go-ai (Virgil)

### Origin

This package was extracted from `forge.lthn.ai/core/go-ai/mlx/`. The split was motivated by:

1. **Platform isolation** — mlx is darwin/arm64 only with CGO + CMake build. Keeping it in go-ai forces the entire AI package to deal with platform-specific build complexity.
2. **Dependency chain** — go-i18n Phase 2a needs MLX inference for Gemma3-1B domain classification. A standalone go-mlx module can be imported directly without pulling in all of go-ai (DuckDB, Parquet, gRPC, Ollama, etc.).
3. **Build tag simplicity** — Every file is `//go:build darwin && arm64`. As a standalone module, this is clean. Inside go-ai, it was a special case that required careful handling.

### What Was Extracted

| Directory | Files | LOC | Purpose |
|-----------|-------|-----|---------|
| Root (`mlx/`) | 16 | ~2,500 | Core MLX bindings, ops, training |
| `model/` | 3 | ~800 | Gemma3, Qwen3 model implementations |
| `tokenizer/` | 1 | ~324 | BPE tokenizer |
| `sample/` | 1 | ~150 | Sampling strategies |
| `cache/` | 1 | ~201 | KV cache for inference |
| **Total** | **22** | **~4,354** | |

### Import Path Changes

All internal imports rewritten:
- `forge.lthn.ai/core/go-ai/mlx` → `forge.lthn.ai/core/go-mlx`
- `forge.lthn.ai/core/go-ai/mlx/cache` → `forge.lthn.ai/core/go-mlx/cache`
- `forge.lthn.ai/core/go-ai/mlx/tokenizer` → `forge.lthn.ai/core/go-mlx/tokenizer`
- `forge.lthn.ai/core/go-ai/mlx/model` → `forge.lthn.ai/core/go-mlx/model`
- `forge.lthn.ai/core/go-ai/mlx/sample` → `forge.lthn.ai/core/go-mlx/sample`

### Upstream Consumer

`go-ai/ml/backend_mlx.go` is the only file outside mlx/ that imports it. After split, go-ai needs either:
- A `replace` directive: `replace forge.lthn.ai/core/go-mlx => ../go-mlx`
- Or a published module version

### What Stayed in go-ai

- `ml/backend_mlx.go` (253 LOC) — the Backend adapter that calls go-mlx. This stays in go-ai because it implements the go-ai-specific `Backend` interface.
- `test-mlx.go` — integration test utility (go-ai root). Needs updating to import from go-mlx.
- `TEST-RESULTS.md` — comprehensive test report (stays as historical record).

---

## 2026-02-19: Test Coverage Assessment

### Tested (3 test files)

| File | Tests | Coverage |
|------|-------|---------|
| `grad_test.go` | VJP/gradient computation | Good — tests forward+backward pass |
| `lora_test.go` | LoRA adapter | Good — tests apply/merge/save |
| `optim_test.go` | AdamW optimiser | Good — tests step/state |

### Not Tested (critical gaps)

| File | LOC | Risk | Notes |
|------|-----|------|-------|
| `ops.go` | 353 | **High** | MatMul, Softmax, element-wise ops — core of everything |
| `array.go` | 261 | **High** | Array creation, reshape, data access — foundational |
| `nn.go` | ~150 | Medium | Linear, Embedding, RMSNorm layers |
| `fast.go` | ~100 | Medium | Fused Metal kernels (RoPE, ScaledDotProduct) |
| `model/*.go` | ~800 | **High** | No tests for Gemma3/Qwen3 forward pass |
| `tokenizer/` | 324 | **High** | No BPE encode/decode tests |
| `sample/` | ~150 | Medium | No sampling tests |
| `cache/` | 201 | Medium | No KV cache tests |
| `io.go` | ~100 | Medium | No safetensors load tests |

### Error Handling

The error handler in `mlx.go` stores the last error in a C static variable and logs it via `slog.Error`. This is **not propagated to Go callers**. Functions like `MatMul`, `Softmax`, etc. return `*Array` with no error — if the C operation fails, the caller gets a nil/invalid array with no indication why.

### Memory Management

Arrays use `runtime.SetFinalizer` for C-side deallocation. Under sustained inference (1000+ tokens), this relies on GC pressure to trigger finalizers. No explicit `Close()` or `Free()` method exists on Array — could leak under high throughput if GC doesn't keep up.

---

## 2026-02-19: Dependency Chain

```
go-i18n (Phase 2a: domain classification)
    └── needs Gemma3-1B inference
        └── go-mlx (this package)
            └── mlx-c v0.4.1 (CMake, fetched from GitHub)
                └── Apple MLX (Metal GPU compute)

go-ai/ml/backend_mlx.go
    └── imports go-mlx
        └── implements go-ai Backend interface
```

### LEM Lab Connection

LEM Lab (the native MLX chat UI at `localhost:8090`) also uses this code path. Currently working with Qwen3-8B streaming. The model/ directory supports both Gemma3 and Qwen3.

---

## 2026-02-19: Hardware Test Results (from go-ai TEST-RESULTS.md)

Tested on Mac Studio M3 Ultra (32-core CPU, 60-core GPU, 96GB unified memory):
- All 84 go-ai tests pass (including 3 mlx tests)
- MLX grad, lora, optim tests all pass
- Go 1.25.7, mlx-c v0.4.1

### Model Inventory (safetensors)

Available on `/Volumes/Data/lem/safetensors/`:
- Gemma3-1B, Gemma3-4B, Gemma3-27B
- Qwen3-8B (used by LEM Lab)

---

## 2026-02-19: Go 1.26 Impact Assessment

Source: https://go.dev/doc/go1.26

### High Impact (free performance, no code changes)

**CGO call overhead reduced ~30%**
Every MLX operation (MatMul, Add, Softmax, RoPE, etc.) crosses the CGO boundary. The runtime previously used a dedicated syscall P state for cgo calls; Go 1.26 removes that and checks goroutine status instead. This is a direct, automatic performance win for the entire package.

**Green Tea GC now default (10-40% less GC overhead)**
Critical for go-mlx because `Array` objects use `runtime.SetFinalizer` for C-side deallocation via `mlx_*_free()`. Reduced GC overhead means:
- More timely finaliser execution during sustained inference
- Less memory pressure from stale Array objects waiting for GC
- The FINDINGS.md concern about "GC not keeping up under high throughput" is partially mitigated
- Opt-out: `GOEXPERIMENT=nogreenteagc` (temporary, removed in 1.27)

### Medium Impact

**Slice stack allocation in more situations**
The compiler can now allocate slice backing stores on the stack more often. Benefits small temporary slices in `Collect()`, shape manipulation, and internal ops helpers. Debug: `-compile=variablemakehash` flag.

**`testing.B.Loop` inlining fix**
When we add benchmarks (Phase 1), `b.Loop()` style now properly inlines loop bodies. Important for micro-benchmarks of small ops like Add, Multiply.

**Heap base address randomisation (64-bit)**
Security improvement for CGO programs. Randomises heap base at startup. Disable: `GOEXPERIMENT=norandomizedheapbase64`.

### Clarification on Range-over-func

Virgil's Phase 6 TODO mentions "if 1.26 stabilises range-over-func". **Range-over-func has been stable since Go 1.23** and the `iter` package was added in 1.23. Since go.mod is already at Go 1.25.5, `Array.Iter() iter.Seq[float32]` can be implemented today without a version bump. Go 1.26 adds no new iterator features beyond what 1.23-1.25 provide.

### Recommendation

No Go version bump needed for the performance wins — they're automatic at runtime. The only code-level Go 1.26 feature that matters is `testing.ArtifactDir()` for benchmark result storage, which is minor. Focus remains on Phase 1 hardening.

---

## 2026-02-19: go-ai Split Context

Virgil is splitting go-ai into sub-packages, with go-ai becoming a meta/catch-all for ML features. go-mlx was the first extraction. This means:
- More packages will follow the go-mlx pattern (standalone module, own build, own tests)
- go-ai will eventually be a thin layer importing sub-packages
- The `replace` directive approach works for development; published modules for releases

---

## 2026-02-19: Floats()/DataInt32() Unsafe on Non-Contiguous Arrays

**Discovery**: `Array.Floats()` and `Array.DataInt32()` read `Size()` elements from the raw C data pointer (`mlx_array_data_float32`). For non-contiguous arrays (transpose, broadcast, slice views), the physical memory layout doesn't match the logical layout. Reading `Size()` contiguous elements returns incorrect data or reads past the physical buffer.

**Affected operations**: `Transpose()`, `BroadcastTo()`, `SliceAxis()`, `Slice()`, `AsStrided()` — any operation that creates a view rather than a copy.

**Workaround**: `Reshape(arr, totalSize)` forces a contiguous copy before reading flat data. All tests use this pattern for view operations.

**Fix needed (Phase 4)**: Either:
1. Add a `Contiguous()` method that wraps `mlx_contiguous` (if available in mlx-c)
2. Or have `Floats()`/`DataInt32()` automatically force contiguity before reading
3. Document the behaviour clearly if views are intentionally lazy

This is a data correctness issue — silent wrong results, not a crash.

---

## 2026-02-19: Backend Abstraction — COMPLETED

**Design doc:** `docs/plans/2026-02-19-backend-abstraction-design.md`
**Implementation plan:** `docs/plans/2026-02-19-backend-abstraction-plan.md`

### What changed

The entire public API has been replaced. All CGO code is now in `internal/metal/`. The root package is a clean interface layer:

```go
m, _ := mlx.LoadModel("/path/to/model/")
defer m.Close()
ctx := context.Background()
for tok := range m.Generate(ctx, "prompt", mlx.WithMaxTokens(128)) {
    fmt.Print(tok.Text)
}
if err := m.Err(); err != nil { log.Fatal(err) }
```

The old API (`Array`, `MatMul`, `model.LoadModel`, `model.Model`, sub-packages `model/`, `tokenizer/`, `sample/`, `cache/`) is no longer public. All moved to `internal/metal/`.

### Architecture note: import cycle resolution

`internal/metal/` cannot import the root package (circular dependency). Solution: internal/metal defines its own concrete types (`metal.Token`, `metal.GenerateConfig`, `metal.Model`), and `register_metal.go` in root provides a thin adapter (`metalAdapter`) that converts between root types (`mlx.Token`) and metal types.

### Impact on go-ml

`backend_mlx.go` must migrate from direct tensor manipulation to:
```go
m, _ := mlx.LoadModel(path)
ctx := context.Background()
for tok := range m.Generate(ctx, prompt, mlx.WithMaxTokens(n)) { ... }
if err := m.Err(); err != nil { ... }
```
253 LOC → ~60 LOC. Memory controls: `mlx.SetCacheLimit()`, `mlx.ClearCache()`, etc.

### Impact on go-i18n

```go
m, _ := mlx.LoadModel("/path/to/gemma-3-1b/")
ctx := context.Background()
for tok := range m.Generate(ctx, sentence, mlx.WithMaxTokens(32)) { ... }
```

### Memory management status

`Close()` stub is in place but does not yet explicitly free model weights. Per-step intermediate cleanup (`ClearCache()` per decode step) is implemented in the generate loop. Full deterministic cleanup awaits CLion Claude research on `mlx_array_free` safety (see `cpp/TODO.md`).

### Test results

- 148 existing tests moved to `internal/metal/` — all pass
- 7 new integration tests for public API — all pass
- Total: 155 tests passing

---

## 2026-02-19: Migration to go-inference Shared Interfaces

### What changed

go-mlx no longer defines its own `TextModel`, `Backend`, `Token`, `Message`, `GenerateConfig`, `GenerateOption`, `LoadConfig`, `LoadOption` types. These are now provided by `forge.lthn.ai/core/go-inference`, a zero-dependency shared interface package.

### Files removed

- `textmodel.go` — `Token`, `Message`, `TextModel` now in go-inference
- `options.go` — `GenerateConfig`, `GenerateOption`, `LoadConfig`, `LoadOption` now in go-inference
- `backend.go` — `Backend`, `Register`, `Get`, `Default`, `LoadModel` now in go-inference

### Files updated

- `register_metal.go` — implements `inference.Backend` (added `Available() bool`), adapts `inference.Token`/`inference.Message`
- `mlx_test.go` — all tests use `inference.*` types, added `TestListBackends`, `TestLoadOptions`, `TestLoadOptionsDefaults`
- `mlx.go` — package doc updated to show go-inference import pattern
- `go.mod` — added `forge.lthn.ai/core/go-inference` dependency (replace directive for local dev)
- `internal/metal/generate.go` — `GenerateConfig` gained `RepeatPenalty float32`

### What go-mlx still exports

- `MetalAvailable() bool` — convenience check
- `SetCacheLimit`, `SetMemoryLimit`, `GetActiveMemory`, `GetPeakMemory`, `ClearCache` — Metal-specific memory controls
- Side-effect import (`_ "forge.lthn.ai/core/go-mlx"`) registers the `"metal"` backend into go-inference's registry

### Consumer migration

Before:
```go
import "forge.lthn.ai/core/go-mlx"
m, _ := mlx.LoadModel(path)
for tok := range m.Generate(ctx, prompt, mlx.WithMaxTokens(128)) { ... }
```

After:
```go
import (
    "forge.lthn.ai/core/go-inference"
    _ "forge.lthn.ai/core/go-mlx" // register Metal backend
)
m, _ := inference.LoadModel(path)
for tok := range m.Generate(ctx, prompt, inference.WithMaxTokens(128)) { ... }
```

### newArray go-inference features available

- `inference.List()` — returns all registered backend names
- `inference.Backend.Available()` — hardware availability check
- `inference.WithRepeatPenalty(p)` — repetition penalty option
- `inference.WithContextLen(n)` — context window size
- `inference.WithGPULayers(n)` — GPU layer offload control (-1 = all)
- `inference.LoadConfig.GPULayers` defaults to -1 (full GPU offload)

### Test results

- 148 internal/metal tests — all pass
- 11 root integration tests — all pass
- Total: 159 tests passing

---

## 2026-02-19: CLion Claude Research Applied

### Contiguous Array Fix (data correctness bug)

`Floats()`, `DataInt32()`, and `Ints()` now automatically handle non-contiguous arrays. Previously, reading data from view arrays (Transpose, BroadcastTo, SliceAxis) returned silently wrong results.

**Fix**: Bound `mlx_contiguous` and `_mlx_array_is_row_contiguous` from mlx-c. The `ensureContiguous()` helper checks `IsRowContiguous()` and makes a contiguous copy when needed before accessing the raw data pointer.

The old workaround of `Reshape(arr, totalSize)` to force contiguity is no longer needed.

### TopP (Nucleus) Sampling Implemented

Was a stub that passed logits through unchanged. Now fully implemented:
1. Softmax to get probabilities
2. Argsort descending to get sorted indices
3. CumSum of sorted probabilities
4. Mask tokens where cumulative probability (excluding current) exceeds threshold
5. Scatter mask back to original positions via PutAlongAxis + argsort indices

### MinP Sampling Implemented

Was a stub. Now masks tokens whose probability is below `min_p * max_prob`. Uses MaxAxis to find the peak probability per position.

### newArray Bindings

| Function | Header | Purpose |
|----------|--------|---------|
| `mlx_contiguous` | ops.h | Force row-major contiguous layout |
| `_mlx_array_is_row_contiguous` | array.h | Check contiguity without copying |
| `mlx_cumsum` | ops.h | Cumulative sum (forward/reverse, inclusive/exclusive) |
| `mlx_sort_axis` | ops.h | Sort along axis |
| `mlx_argsort_axis` | ops.h | Indices that would sort |
| `mlx_greater` | ops.h | Element-wise comparison |
| `mlx_max_axis` | ops.h | Maximum along axis |
| `mlx_get_cache_memory` | memory.h | Current allocator cache size |
| `mlx_reset_peak_memory` | memory.h | Reset peak memory tracking |
| `mlx_set_wired_limit` | memory.h | Wired memory limit control |
| `mlx_metal_device_info` | metal.h | GPU hardware info |

### Test results

- 165 internal/metal tests — all pass
- 11 root integration tests — all pass
- Total: 176 tests passing

---

## 2026-02-19: Benchmark Baseline — M3 Ultra

29 benchmarks in `internal/metal/bench_test.go`. All times in ns/op, measured with `go test -bench=. -benchtime=2s`.

### Matrix Multiply

| Shape | ns/op | Notes |
|-------|------:|-------|
| 128×128 | 194,467 | CGO overhead dominates at small sizes |
| 512×512 | 255,288 | GPU starting to amortise |
| 1024×1024 | 474,900 | Sweet spot for Metal throughput |
| 2048×2048 | 4,173,797 | ~4ms — good for decode step |
| 4096×4096 | 10,715,051 | ~10.7ms — large context attention |
| 1×2048 → 32000 (token proj) | 626,087 | Output projection per token |

### Fused Metal Kernels

| Operation | Shape | ns/op |
|-----------|-------|------:|
| RMSNorm | 1×2048 | 156,696 |
| RMSNorm | 32×2048 | 225,164 |
| LayerNorm | 32×2048 | 184,514 |
| RoPE | 1×1×32×128 (decode) | 176,605 |
| RoPE | 1×32×512×128 (prefill) | 1,443,803 |
| SDPA causal | 1 head, seq=32 | 200,926 |
| SDPA causal | 32 heads, seq=128 | 515,477 |
| SDPA causal | 32 heads, seq=512 | 1,815,073 |

### Softmax & Reductions

| Operation | Shape | ns/op |
|-----------|-------|------:|
| Softmax | 1×1024 | 173,811 |
| Softmax | 32×32000 | 948,660 |
| Softmax | 1×128000 | 270,022 |
| Sum | 1M elements | 175,204 |
| Argmax | 1×32000 | 171,327 |

### Element-wise (1M elements)

| Operation | ns/op |
|-----------|------:|
| Add | 651,687 |
| Mul | 394,941 |
| SiLU | 1,192,843 |

### Layers

| Operation | Shape | ns/op |
|-----------|-------|------:|
| Linear | 1×2048 → 2048 | 181,417 |
| Linear | 32×2048 → 8192 | 471,038 |
| Embedding | 32 tokens, 32K vocab, 2048 dim | 219,154 |

### Sampling (vocab=32000)

| Strategy | ns/op |
|----------|------:|
| Greedy (argmax) | 172,698 |
| TopK=50, temp=1.0 | 542,635 |
| TopP=0.9, temp=1.0 | 713,538 |
| Full (TopP+MinP+TopK) | 731,118 |

### Key Observations

1. **CGO floor ~170μs**: All operations have a ~170μs minimum (greedy sample, RMSNorm single row, Sum 1M). This is the CGO call + Metal command buffer overhead.
2. **MatMul scales well**: 128² → 4096² is only ~55× slower for 1024× more work, showing good GPU utilisation.
3. **SDPA efficient**: 32-head seq=512 attention at 1.8ms is practical for real-time inference.
4. **Sampling overhead**: Full chain (TopP+MinP+TopK) adds ~560μs over greedy — acceptable per token.
5. **Linear layer**: Single-token forward through 2048→2048 at 181μs suggests ~5500 layers/sec ceiling for per-token decode.

---

## 2026-02-19: Error Handling Audit — COMPLETED

### What changed

The old `checkError()` function logged errors via `slog.Error` and swallowed them. The mlx-c error handler (`mlx_go_error_handler`) stored the error string in a C static variable, but no Go code read it back as an error value.

### newArray error model

1. **`lastError() error`** — reads and clears the C-level error string. Returns `fmt.Errorf("mlx: %s", msg)` or nil. All callers now get real MLX error messages instead of generic "failed" strings.

2. **`Eval(...*Array) error`** — error-returning variant of Materialize. Checks `mlx_eval` return code and surfaces errors through `lastError()`. The key error surface for GPU computation failures (OOM, invalid graph, etc.).

3. **`EvalAsync(...*Array) error`** — same for async evaluation.

4. **`Materialize()` unchanged** — delegates to `Eval()`, logs errors. 237 existing callsites (mostly tests) unchanged.

### Error propagation paths

| Path | Before | After |
|------|--------|-------|
| Generate loop | Silent failure | `Eval()` → `m.lastErr` → `model.Err()` |
| Safetensors load | Silent zero results | `LoadAllSafetensors` returns error; `LoadSafetensors` checks rc |
| Model load (gemma3/qwen3) | Missing weights panic | Check `lastError()` after each safetensors file |
| VJP/JVP/ValueAndGrad | Generic "vjp failed" | Real MLX error message via `lastError()` |
| LoRA save | Generic "save failed" | Real MLX error message via `lastError()` |

### C-level changes

- Removed `fprintf(stderr, ...)` from error handler — errors now surface through Go, not stderr
- Added `get_and_clear_last_error()` — reads and atomically clears the stored error string
- Old `get_last_error()` removed (never cleared, so stale errors persisted across operations)

### Test results

- 180 tests passing (176 existing + 4 new error handling tests)
- newArray tests: Eval success, Eval nil safety, lastError no-error, LoadAllSafetensors missing file