go-mlx/FINDINGS.md

# FINDINGS.md — go-mlx Research & Discovery

Record findings, gaps, and architectural decisions here as work progresses.

---

## 2026-02-19: Split from go-ai (Virgil)

### Origin

This package was extracted from `forge.lthn.ai/core/go-ai/mlx/`. The split was motivated by:

1. **Platform isolation** — mlx is darwin/arm64 only with CGO + CMake build. Keeping it in go-ai forces the entire AI package to deal with platform-specific build complexity.
2. **Dependency chain** — go-i18n Phase 2a needs MLX inference for Gemma3-1B domain classification. A standalone go-mlx module can be imported directly without pulling in all of go-ai (DuckDB, Parquet, gRPC, Ollama, etc.).
3. **Build tag simplicity** — Every file is `//go:build darwin && arm64`. As a standalone module, this is clean. Inside go-ai, it was a special case that required careful handling.

### What Was Extracted

| Directory | Files | LOC | Purpose |
|-----------|-------|-----|---------|
| Root (`mlx/`) | 16 | ~2,500 | Core MLX bindings, ops, training |
| `model/` | 3 | ~800 | Gemma3, Qwen3 model implementations |
| `tokenizer/` | 1 | ~324 | BPE tokenizer |
| `sample/` | 1 | ~150 | Sampling strategies |
| `cache/` | 1 | ~201 | KV cache for inference |
| **Total** | **22** | **~4,354** | |

### Import Path Changes

All internal imports rewritten:
- `forge.lthn.ai/core/go-ai/mlx` → `forge.lthn.ai/core/go-mlx`
- `forge.lthn.ai/core/go-ai/mlx/cache` → `forge.lthn.ai/core/go-mlx/cache`
- `forge.lthn.ai/core/go-ai/mlx/tokenizer` → `forge.lthn.ai/core/go-mlx/tokenizer`
- `forge.lthn.ai/core/go-ai/mlx/model` → `forge.lthn.ai/core/go-mlx/model`
- `forge.lthn.ai/core/go-ai/mlx/sample` → `forge.lthn.ai/core/go-mlx/sample`

### Upstream Consumer

`go-ai/ml/backend_mlx.go` is the only file outside mlx/ that imports it. After split, go-ai needs either:
- A `replace` directive: `replace forge.lthn.ai/core/go-mlx => ../go-mlx`
- Or a published module version

### What Stayed in go-ai

- `ml/backend_mlx.go` (253 LOC) — the Backend adapter that calls go-mlx. This stays in go-ai because it implements the go-ai-specific `Backend` interface.
- `test-mlx.go` — integration test utility (go-ai root). Needs updating to import from go-mlx.
- `TEST-RESULTS.md` — comprehensive test report (stays as historical record).

---

## 2026-02-19: Test Coverage Assessment

### Tested (3 test files)

| File | Tests | Coverage |
|------|-------|---------|
| `grad_test.go` | VJP/gradient computation | Good — tests forward+backward pass |
| `lora_test.go` | LoRA adapter | Good — tests apply/merge/save |
| `optim_test.go` | AdamW optimiser | Good — tests step/state |

### Not Tested (critical gaps)

| File | LOC | Risk | Notes |
|------|-----|------|-------|
| `ops.go` | 353 | **High** | MatMul, Softmax, element-wise ops — core of everything |
| `array.go` | 261 | **High** | Array creation, reshape, data access — foundational |
| `nn.go` | ~150 | Medium | Linear, Embedding, RMSNorm layers |
| `fast.go` | ~100 | Medium | Fused Metal kernels (RoPE, ScaledDotProduct) |
| `model/*.go` | ~800 | **High** | No tests for Gemma3/Qwen3 forward pass |
| `tokenizer/` | 324 | **High** | No BPE encode/decode tests |
| `sample/` | ~150 | Medium | No sampling tests |
| `cache/` | 201 | Medium | No KV cache tests |
| `io.go` | ~100 | Medium | No safetensors load tests |

### Error Handling

The error handler in `mlx.go` stores the last error in a C static variable and logs it via `slog.Error`. This is **not propagated to Go callers**. Functions like `MatMul`, `Softmax`, etc. return `*Array` with no error — if the C operation fails, the caller gets a nil/invalid array with no indication why.

### Memory Management

Arrays use `runtime.SetFinalizer` for C-side deallocation. Under sustained inference (1000+ tokens), this relies on GC pressure to trigger finalizers. No explicit `Close()` or `Free()` method exists on Array — could leak under high throughput if GC doesn't keep up.

---

## 2026-02-19: Dependency Chain

```
go-i18n (Phase 2a: domain classification)
    └── needs Gemma3-1B inference
        └── go-mlx (this package)
            └── mlx-c v0.4.1 (CMake, fetched from GitHub)
                └── Apple MLX (Metal GPU compute)

go-ai/ml/backend_mlx.go
    └── imports go-mlx
        └── implements go-ai Backend interface
```

### LEM Lab Connection

LEM Lab (the native MLX chat UI at `localhost:8090`) also uses this code path. Currently working with Qwen3-8B streaming. The model/ directory supports both Gemma3 and Qwen3.

---

## 2026-02-19: Hardware Test Results (from go-ai TEST-RESULTS.md)

Tested on Mac Studio M3 Ultra (32-core CPU, 60-core GPU, 96GB unified memory):
- All 84 go-ai tests pass (including 3 mlx tests)
- MLX grad, lora, optim tests all pass
- Go 1.25.7, mlx-c v0.4.1

### Model Inventory (safetensors)

Available on `/Volumes/Data/lem/safetensors/`:
- Gemma3-1B, Gemma3-4B, Gemma3-27B
- Qwen3-8B (used by LEM Lab)

---

## 2026-02-19: Go 1.26 Impact Assessment

Source: https://go.dev/doc/go1.26

### High Impact (free performance, no code changes)

**CGO call overhead reduced ~30%**
Every MLX operation (MatMul, Add, Softmax, RoPE, etc.) crosses the CGO boundary. The runtime previously used a dedicated syscall P state for cgo calls; Go 1.26 removes that and checks goroutine status instead. This is a direct, automatic performance win for the entire package.

**Green Tea GC now default (10-40% less GC overhead)**
Critical for go-mlx because `Array` objects use `runtime.SetFinalizer` for C-side deallocation via `mlx_*_free()`. Reduced GC overhead means:
- More timely finaliser execution during sustained inference
- Less memory pressure from stale Array objects waiting for GC
- The FINDINGS.md concern about "GC not keeping up under high throughput" is partially mitigated
- Opt-out: `GOEXPERIMENT=nogreenteagc` (temporary, removed in 1.27)

### Medium Impact

**Slice stack allocation in more situations**
The compiler can now allocate slice backing stores on the stack more often. Benefits small temporary slices in `Collect()`, shape manipulation, and internal ops helpers. Debug: `-compile=variablemakehash` flag.

**`testing.B.Loop` inlining fix**
When we add benchmarks (Phase 1), `b.Loop()` style now properly inlines loop bodies. Important for micro-benchmarks of small ops like Add, Multiply.

**Heap base address randomisation (64-bit)**
Security improvement for CGO programs. Randomises heap base at startup. Disable: `GOEXPERIMENT=norandomizedheapbase64`.

### Clarification on Range-over-func

Virgil's Phase 6 TODO mentions "if 1.26 stabilises range-over-func". **Range-over-func has been stable since Go 1.23** and the `iter` package was added in 1.23. Since go.mod is already at Go 1.25.5, `Array.Iter() iter.Seq[float32]` can be implemented today without a version bump. Go 1.26 adds no new iterator features beyond what 1.23-1.25 provide.

### Recommendation

No Go version bump needed for the performance wins — they're automatic at runtime. The only code-level Go 1.26 feature that matters is `testing.ArtifactDir()` for benchmark result storage, which is minor. Focus remains on Phase 1 hardening.

---

## 2026-02-19: go-ai Split Context

Virgil is splitting go-ai into sub-packages, with go-ai becoming a meta/catch-all for ML features. go-mlx was the first extraction. This means:
- More packages will follow the go-mlx pattern (standalone module, own build, own tests)
- go-ai will eventually be a thin layer importing sub-packages
- The `replace` directive approach works for development; published modules for releases

---

## 2026-02-19: Floats()/DataInt32() Unsafe on Non-Contiguous Arrays

**Discovery**: `Array.Floats()` and `Array.DataInt32()` read `Size()` elements from the raw C data pointer (`mlx_array_data_float32`). For non-contiguous arrays (transpose, broadcast, slice views), the physical memory layout doesn't match the logical layout. Reading `Size()` contiguous elements returns incorrect data or reads past the physical buffer.

**Affected operations**: `Transpose()`, `BroadcastTo()`, `SliceAxis()`, `Slice()`, `AsStrided()` — any operation that creates a view rather than a copy.

**Workaround**: `Reshape(arr, totalSize)` forces a contiguous copy before reading flat data. All tests use this pattern for view operations.

**Fix needed (Phase 4)**: Either:
1. Add a `Contiguous()` method that wraps `mlx_contiguous` (if available in mlx-c)
2. Or have `Floats()`/`DataInt32()` automatically force contiguity before reading
3. Document the behaviour clearly if views are intentionally lazy

This is a data correctness issue — silent wrong results, not a crash.

---

## 2026-02-19: Backend Abstraction — API Breaking Change

**Design doc:** `docs/plans/2026-02-19-backend-abstraction-design.md`

### What's changing

The entire public API is being replaced. All CGO code moves to `internal/metal/`. The root package becomes a clean interface layer:

```go
m, _ := mlx.LoadModel("/path/to/model/")
defer m.Close()
for tok := range m.Generate("prompt", mlx.WithMaxTokens(128)) {
    fmt.Print(tok.Text)
}
```

The old API (`Array`, `MatMul`, `model.LoadModel`, `model.Model`, etc.) will no longer be public.

### Impact on go-ai

`backend_mlx.go` currently imports root-level Array, ops, model types. These move to `internal/metal/` and become inaccessible. Migration: replace direct tensor manipulation with `mlx.LoadModel()` + `mlx.TextModel.Generate()`.

### Impact on go-i18n

The API for Gemma3-1B domain classification will be:
```go
m, _ := mlx.LoadModel("/path/to/gemma-3-1b/")
for tok := range m.Generate(sentence, mlx.WithMaxTokens(32)) { ... }
```
Streaming via `iter.Seq[Token]`. No tokenisation or sampling to handle.

### Memory leak fix included

The refactor includes deterministic memory management — `TextModel.Close()` for model weights and per-step intermediate cleanup during generation. This addresses the current production blocker.
feat: extract go-mlx from go-ai as standalone Metal inference package Split mlx/ directory from forge.lthn.ai/core/go-ai into its own module. Rewrites import paths, adds CLAUDE.md/TODO.md/FINDINGS.md for dedicated Claude sessions. Zero external Go deps — pure CGO + mlx-c v0.4.1. Co-Authored-By: Virgil <virgil@lethean.io> 2026-02-19 17:57:37 +00:00			`# FINDINGS.md — go-mlx Research & Discovery`

			`Record findings, gaps, and architectural decisions here as work progresses.`

			`---`

			`## 2026-02-19: Split from go-ai (Virgil)`

			`### Origin`

			This package was extracted from `forge.lthn.ai/core/go-ai/mlx/`. The split was motivated by:

			`1. Platform isolation — mlx is darwin/arm64 only with CGO + CMake build. Keeping it in go-ai forces the entire AI package to deal with platform-specific build complexity.`
			`2. Dependency chain — go-i18n Phase 2a needs MLX inference for Gemma3-1B domain classification. A standalone go-mlx module can be imported directly without pulling in all of go-ai (DuckDB, Parquet, gRPC, Ollama, etc.).`
			3. Build tag simplicity — Every file is `//go:build darwin && arm64`. As a standalone module, this is clean. Inside go-ai, it was a special case that required careful handling.

			`### What Was Extracted`

			`\| Directory \| Files \| LOC \| Purpose \|`
			`\|-----------\|-------\|-----\|---------\|`
			\| Root (`mlx/`) \| 16 \| ~2,500 \| Core MLX bindings, ops, training \|
			\| `model/` \| 3 \| ~800 \| Gemma3, Qwen3 model implementations \|
			\| `tokenizer/` \| 1 \| ~324 \| BPE tokenizer \|
			\| `sample/` \| 1 \| ~150 \| Sampling strategies \|
			\| `cache/` \| 1 \| ~201 \| KV cache for inference \|
			`\| Total \| 22 \| ~4,354 \| \|`

			`### Import Path Changes`

			`All internal imports rewritten:`
			- `forge.lthn.ai/core/go-ai/mlx` → `forge.lthn.ai/core/go-mlx`
			- `forge.lthn.ai/core/go-ai/mlx/cache` → `forge.lthn.ai/core/go-mlx/cache`
			- `forge.lthn.ai/core/go-ai/mlx/tokenizer` → `forge.lthn.ai/core/go-mlx/tokenizer`
			- `forge.lthn.ai/core/go-ai/mlx/model` → `forge.lthn.ai/core/go-mlx/model`
			- `forge.lthn.ai/core/go-ai/mlx/sample` → `forge.lthn.ai/core/go-mlx/sample`

			`### Upstream Consumer`

			`go-ai/ml/backend_mlx.go` is the only file outside mlx/ that imports it. After split, go-ai needs either:
			- A `replace` directive: `replace forge.lthn.ai/core/go-mlx => ../go-mlx`
			`- Or a published module version`

			`### What Stayed in go-ai`

			- `ml/backend_mlx.go` (253 LOC) — the Backend adapter that calls go-mlx. This stays in go-ai because it implements the go-ai-specific `Backend` interface.
			- `test-mlx.go` — integration test utility (go-ai root). Needs updating to import from go-mlx.
			- `TEST-RESULTS.md` — comprehensive test report (stays as historical record).

			`---`

			`## 2026-02-19: Test Coverage Assessment`

			`### Tested (3 test files)`

			`\| File \| Tests \| Coverage \|`
			`\|------\|-------\|---------\|`
			\| `grad_test.go` \| VJP/gradient computation \| Good — tests forward+backward pass \|
			\| `lora_test.go` \| LoRA adapter \| Good — tests apply/merge/save \|
			\| `optim_test.go` \| AdamW optimiser \| Good — tests step/state \|

			`### Not Tested (critical gaps)`

			`\| File \| LOC \| Risk \| Notes \|`
			`\|------\|-----\|------\|-------\|`
			\| `ops.go` \| 353 \| High \| MatMul, Softmax, element-wise ops — core of everything \|
			\| `array.go` \| 261 \| High \| Array creation, reshape, data access — foundational \|
			\| `nn.go` \| ~150 \| Medium \| Linear, Embedding, RMSNorm layers \|
			\| `fast.go` \| ~100 \| Medium \| Fused Metal kernels (RoPE, ScaledDotProduct) \|
			\| `model/.go` \| ~800 \| High* \| No tests for Gemma3/Qwen3 forward pass \|
			\| `tokenizer/` \| 324 \| High \| No BPE encode/decode tests \|
			\| `sample/` \| ~150 \| Medium \| No sampling tests \|
			\| `cache/` \| 201 \| Medium \| No KV cache tests \|
			\| `io.go` \| ~100 \| Medium \| No safetensors load tests \|

			`### Error Handling`

			The error handler in `mlx.go` stores the last error in a C static variable and logs it via `slog.Error`. This is not propagated to Go callers. Functions like `MatMul`, `Softmax`, etc. return `*Array` with no error — if the C operation fails, the caller gets a nil/invalid array with no indication why.

			`### Memory Management`

			Arrays use `runtime.SetFinalizer` for C-side deallocation. Under sustained inference (1000+ tokens), this relies on GC pressure to trigger finalizers. No explicit `Close()` or `Free()` method exists on Array — could leak under high throughput if GC doesn't keep up.

			`---`

			`## 2026-02-19: Dependency Chain`

			```
			`go-i18n (Phase 2a: domain classification)`
			`└── needs Gemma3-1B inference`
			`└── go-mlx (this package)`
			`└── mlx-c v0.4.1 (CMake, fetched from GitHub)`
			`└── Apple MLX (Metal GPU compute)`

			`go-ai/ml/backend_mlx.go`
			`└── imports go-mlx`
			`└── implements go-ai Backend interface`
			```

			`### LEM Lab Connection`

			LEM Lab (the native MLX chat UI at `localhost:8090`) also uses this code path. Currently working with Qwen3-8B streaming. The model/ directory supports both Gemma3 and Qwen3.

			`---`

			`## 2026-02-19: Hardware Test Results (from go-ai TEST-RESULTS.md)`

			`Tested on Mac Studio M3 Ultra (32-core CPU, 60-core GPU, 96GB unified memory):`
			`- All 84 go-ai tests pass (including 3 mlx tests)`
			`- MLX grad, lora, optim tests all pass`
			`- Go 1.25.7, mlx-c v0.4.1`

			`### Model Inventory (safetensors)`

			Available on `/Volumes/Data/lem/safetensors/`:
			`- Gemma3-1B, Gemma3-4B, Gemma3-27B`
			`- Qwen3-8B (used by LEM Lab)`
test(core): add 86 tests for ops, array, nn, fast kernels Phase 1 hardening: cover all previously-untested core operations. - array_test.go (25): scalar/array creation, shape, clone, free, data access - ops_test.go (44): arithmetic, math, matmul, reductions, shape ops, indexing, slicing, random - nn_test.go (8): Linear (dense/bias/LoRA), Embedding, RMSNormModule, RepeatKV - fast_test.go (9): RMSNorm, LayerNorm, RoPE, ScaledDotProductAttention Found: Floats()/DataInt32() return wrong data on non-contiguous arrays (transpose, broadcast, slice views). Documented in FINDINGS.md. Also: cpp/ workspace docs for CLion Claude session, Go 1.26 impact assessment, verified go generate → test round-trip (29→115 tests). Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> 2026-02-19 18:37:30 +00:00
			`---`

			`## 2026-02-19: Go 1.26 Impact Assessment`

			`Source: https://go.dev/doc/go1.26`

			`### High Impact (free performance, no code changes)`

			`CGO call overhead reduced ~30%`
			`Every MLX operation (MatMul, Add, Softmax, RoPE, etc.) crosses the CGO boundary. The runtime previously used a dedicated syscall P state for cgo calls; Go 1.26 removes that and checks goroutine status instead. This is a direct, automatic performance win for the entire package.`

			`Green Tea GC now default (10-40% less GC overhead)`
			Critical for go-mlx because `Array` objects use `runtime.SetFinalizer` for C-side deallocation via `mlx_*_free()`. Reduced GC overhead means:
			`- More timely finaliser execution during sustained inference`
			`- Less memory pressure from stale Array objects waiting for GC`
			`- The FINDINGS.md concern about "GC not keeping up under high throughput" is partially mitigated`
			- Opt-out: `GOEXPERIMENT=nogreenteagc` (temporary, removed in 1.27)

			`### Medium Impact`

			`Slice stack allocation in more situations`
			The compiler can now allocate slice backing stores on the stack more often. Benefits small temporary slices in `Collect()`, shape manipulation, and internal ops helpers. Debug: `-compile=variablemakehash` flag.

			`testing.B.Loop` inlining fix
			When we add benchmarks (Phase 1), `b.Loop()` style now properly inlines loop bodies. Important for micro-benchmarks of small ops like Add, Multiply.

			`Heap base address randomisation (64-bit)`
			Security improvement for CGO programs. Randomises heap base at startup. Disable: `GOEXPERIMENT=norandomizedheapbase64`.

			`### Clarification on Range-over-func`

			Virgil's Phase 6 TODO mentions "if 1.26 stabilises range-over-func". Range-over-func has been stable since Go 1.23 and the `iter` package was added in 1.23. Since go.mod is already at Go 1.25.5, `Array.Iter() iter.Seq[float32]` can be implemented today without a version bump. Go 1.26 adds no new iterator features beyond what 1.23-1.25 provide.

			`### Recommendation`

			No Go version bump needed for the performance wins — they're automatic at runtime. The only code-level Go 1.26 feature that matters is `testing.ArtifactDir()` for benchmark result storage, which is minor. Focus remains on Phase 1 hardening.

			`---`

			`## 2026-02-19: go-ai Split Context`

			`Virgil is splitting go-ai into sub-packages, with go-ai becoming a meta/catch-all for ML features. go-mlx was the first extraction. This means:`
			`- More packages will follow the go-mlx pattern (standalone module, own build, own tests)`
			`- go-ai will eventually be a thin layer importing sub-packages`
			- The `replace` directive approach works for development; published modules for releases

			`---`

			`## 2026-02-19: Floats()/DataInt32() Unsafe on Non-Contiguous Arrays`

			Discovery: `Array.Floats()` and `Array.DataInt32()` read `Size()` elements from the raw C data pointer (`mlx_array_data_float32`). For non-contiguous arrays (transpose, broadcast, slice views), the physical memory layout doesn't match the logical layout. Reading `Size()` contiguous elements returns incorrect data or reads past the physical buffer.

			Affected operations: `Transpose()`, `BroadcastTo()`, `SliceAxis()`, `Slice()`, `AsStrided()` — any operation that creates a view rather than a copy.

			Workaround: `Reshape(arr, totalSize)` forces a contiguous copy before reading flat data. All tests use this pattern for view operations.

			`Fix needed (Phase 4): Either:`
			1. Add a `Contiguous()` method that wraps `mlx_contiguous` (if available in mlx-c)
			2. Or have `Floats()`/`DataInt32()` automatically force contiguity before reading
			`3. Document the behaviour clearly if views are intentionally lazy`

			`This is a data correctness issue — silent wrong results, not a crash.`
docs(design): backend abstraction with internal/metal reorganisation Approved design for restructuring go-mlx: - Root package becomes clean interface (TextModel, LoadModel, Token) - All CGO code moves to internal/metal/ - Deterministic memory management (Close + per-step cleanup) - Error propagation instead of silent logging - mlxlm/ backend placeholder for Python subprocess support Includes API breaking change communication in FINDINGS.md and memory management research tasks in cpp/TODO.md. See: docs/plans/2026-02-19-backend-abstraction-design.md Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> 2026-02-19 19:12:04 +00:00
			`---`

			`## 2026-02-19: Backend Abstraction — API Breaking Change`

			Design doc: `docs/plans/2026-02-19-backend-abstraction-design.md`

			`### What's changing`

			The entire public API is being replaced. All CGO code moves to `internal/metal/`. The root package becomes a clean interface layer:

			```go
			`m, _ := mlx.LoadModel("/path/to/model/")`
			`defer m.Close()`
			`for tok := range m.Generate("prompt", mlx.WithMaxTokens(128)) {`
			`fmt.Print(tok.Text)`
			`}`
			```

			The old API (`Array`, `MatMul`, `model.LoadModel`, `model.Model`, etc.) will no longer be public.

			`### Impact on go-ai`

			`backend_mlx.go` currently imports root-level Array, ops, model types. These move to `internal/metal/` and become inaccessible. Migration: replace direct tensor manipulation with `mlx.LoadModel()` + `mlx.TextModel.Generate()`.

			`### Impact on go-i18n`

			`The API for Gemma3-1B domain classification will be:`
			```go
			`m, _ := mlx.LoadModel("/path/to/gemma-3-1b/")`
			`for tok := range m.Generate(sentence, mlx.WithMaxTokens(32)) { ... }`
			```
			Streaming via `iter.Seq[Token]`. No tokenisation or sampling to handle.

			`### Memory leak fix included`

			The refactor includes deterministic memory management — `TextModel.Close()` for model weights and per-step intermediate cleanup during generation. This addresses the current production blocker.