core/cli - Lethean Network

core/cli

Author	SHA1	Message	Date
Claude	6b603ee20b	fix: remove Go-side array ref tracking, rely on MLX-C refcounting The Go wrapper was tracking inter-array references via desc.inputs, creating chains that kept all intermediate arrays alive across requests. After 3-4 requests, Metal memory grew to 170GB+ and macOS killed the process. Fix: remove desc.inputs/numRefs entirely. MLX-C has its own internal reference counting — when Go GC finalizes an Array wrapper, it calls mlx_array_free which decrements the C-side refcount. If the C-side count reaches 0, Metal memory is freed. Go GC + MLX-C refcounting together handle all lifecycle management correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	8cdafc8d66	fix: add GC-based memory management for MLX array handles Go GC cannot see Metal/C memory pressure, so intermediate arrays from each forward pass accumulated without bound, causing OOM kills after 3-4 requests. Fix: runtime.SetFinalizer on every Array releases C handles when GC collects them, and runtime.GC() is forced every 4 tokens during generation. Also adds SetMemoryLimit(24GB) as a hard Metal ceiling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	9688e086ca	fix: add Metal cache management to prevent memory growth - Add ClearCache() wrapping mlx_clear_cache - Clear Metal allocator cache every 8 tokens during generation - Set 16GB cache limit on backend init - Prevents GPU memory from growing unbounded during inference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	098f496364	fix: correct SDPA mask mode and slice logits to last position	2026-02-16 05:53:52 +00:00
Claude	09da05d799	fix: use affine quantization mode and infer head_dim from weights	2026-02-16 05:53:52 +00:00
Claude	d3c31aa5a6	debug: add shape logging and stderr error handler for inference debugging	2026-02-16 05:53:52 +00:00
Claude	56c6e2fa8d	feat: support quantized inference (4-bit) for Gemma 3 - Add QuantizedLinear with QuantizedMatmul for packed uint32 weights - Add quantized Embedding with Dequantize before lookup - Parse quantization config (group_size, bits) from config.json - Detect .scales/.biases weight tensors and auto-select quantized path - Add Dequantize op wrapping mlx_dequantize - Add safety guard to KVCache.Update for malformed shapes - Handle tied embeddings with quantization (AsLinear helper) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	a4fde16998	fix: handle both string and array merge formats in tokenizer Gemma 3 tokenizer.json uses [["a","b"],...] format for merges instead of the ["a b",...] format. Support both. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	2a67653bf7	feat: handle nested text_config and language_model weight prefix Supports both multimodal (Gemma3ForConditionalGeneration) and text-only configs. Resolves weights with language_model. prefix fallback. Computes head_dim from hidden_size when missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	9ae86017f4	chore: target macOS 26.0, fix duplicate -lstdc++ linker warning Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	e9d9a3c3a0	fix: remove unused vars in TopP sampler placeholder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	a0f77960a1	fix: resolve CGo type conflict in error handler Use pure C callback instead of //export to avoid const char* vs GoString type mismatch in cgo-generated headers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	5e2d941b4d	fix: correct 20 mlx-c API mismatches for v0.4.1 - Use _axis/_axes variants for softmax, argmax, topk, sum, mean, squeeze, concatenate, argpartition - Fix size_t vs int for count parameters throughout - Fix int64_t strides in as_strided - Add mlx_optional_int + mode param to quantized_matmul - Use mlx_array_new() for null arrays (freqs, key, mask, sinks) - Fix expand_dims to single-axis signature - Fix compile callback signature (size_t index) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	c6597691bb	fix: correct mlx_closure_new_func_payload signature for mlx-c v0.4.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00
Claude	bc28aad526	feat: add native MLX backend for Apple Silicon inference (pkg/mlx) CGo wrapper for mlx-c providing zero-Python Metal GPU inference. Includes Gemma 3 model architecture, BPE tokenizer, KV cache, composable sampling, and OpenAI-compatible serve command. Build-tagged (darwin && arm64 && mlx) with stubs for cross-platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 05:53:52 +00:00

15 commits