Commit graph

15 commits

Author SHA1 Message Date
Claude
6b603ee20b fix: remove Go-side array ref tracking, rely on MLX-C refcounting
The Go wrapper was tracking inter-array references via desc.inputs,
creating chains that kept all intermediate arrays alive across requests.
After 3-4 requests, Metal memory grew to 170GB+ and macOS killed the
process.

Fix: remove desc.inputs/numRefs entirely. MLX-C has its own internal
reference counting — when Go GC finalizes an Array wrapper, it calls
mlx_array_free which decrements the C-side refcount. If the C-side
count reaches 0, Metal memory is freed. Go GC + MLX-C refcounting
together handle all lifecycle management correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
8cdafc8d66 fix: add GC-based memory management for MLX array handles
Go GC cannot see Metal/C memory pressure, so intermediate arrays from
each forward pass accumulated without bound, causing OOM kills after
3-4 requests. Fix: runtime.SetFinalizer on every Array releases C
handles when GC collects them, and runtime.GC() is forced every 4
tokens during generation. Also adds SetMemoryLimit(24GB) as a hard
Metal ceiling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
9688e086ca fix: add Metal cache management to prevent memory growth
- Add ClearCache() wrapping mlx_clear_cache
- Clear Metal allocator cache every 8 tokens during generation
- Set 16GB cache limit on backend init
- Prevents GPU memory from growing unbounded during inference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
098f496364 fix: correct SDPA mask mode and slice logits to last position 2026-02-16 05:53:52 +00:00
Claude
09da05d799 fix: use affine quantization mode and infer head_dim from weights 2026-02-16 05:53:52 +00:00
Claude
d3c31aa5a6 debug: add shape logging and stderr error handler for inference debugging 2026-02-16 05:53:52 +00:00
Claude
56c6e2fa8d feat: support quantized inference (4-bit) for Gemma 3
- Add QuantizedLinear with QuantizedMatmul for packed uint32 weights
- Add quantized Embedding with Dequantize before lookup
- Parse quantization config (group_size, bits) from config.json
- Detect .scales/.biases weight tensors and auto-select quantized path
- Add Dequantize op wrapping mlx_dequantize
- Add safety guard to KVCache.Update for malformed shapes
- Handle tied embeddings with quantization (AsLinear helper)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
a4fde16998 fix: handle both string and array merge formats in tokenizer
Gemma 3 tokenizer.json uses [["a","b"],...] format for merges
instead of the ["a b",...] format. Support both.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
2a67653bf7 feat: handle nested text_config and language_model weight prefix
Supports both multimodal (Gemma3ForConditionalGeneration) and
text-only configs. Resolves weights with language_model. prefix
fallback. Computes head_dim from hidden_size when missing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
9ae86017f4 chore: target macOS 26.0, fix duplicate -lstdc++ linker warning
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
e9d9a3c3a0 fix: remove unused vars in TopP sampler placeholder
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
a0f77960a1 fix: resolve CGo type conflict in error handler
Use pure C callback instead of //export to avoid const char* vs
GoString type mismatch in cgo-generated headers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
5e2d941b4d fix: correct 20 mlx-c API mismatches for v0.4.1
- Use _axis/_axes variants for softmax, argmax, topk, sum, mean, squeeze,
  concatenate, argpartition
- Fix size_t vs int for count parameters throughout
- Fix int64_t strides in as_strided
- Add mlx_optional_int + mode param to quantized_matmul
- Use mlx_array_new() for null arrays (freqs, key, mask, sinks)
- Fix expand_dims to single-axis signature
- Fix compile callback signature (size_t index)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
c6597691bb fix: correct mlx_closure_new_func_payload signature for mlx-c v0.4.1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00
Claude
bc28aad526 feat: add native MLX backend for Apple Silicon inference (pkg/mlx)
CGo wrapper for mlx-c providing zero-Python Metal GPU inference.
Includes Gemma 3 model architecture, BPE tokenizer, KV cache,
composable sampling, and OpenAI-compatible serve command.

Build-tagged (darwin && arm64 && mlx) with stubs for cross-platform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 05:53:52 +00:00