- Fix remaining 187 pkg/ files referencing core/cli → core/go
- Move SDK library code from internal/cmd/sdk/ → pkg/sdk/ (new package)
- Create pkg/rag/helpers.go with convenience functions from internal/cmd/rag/
- Fix pkg/mcp/tools_rag.go to use pkg/rag instead of internal/cmd/rag
- Fix pkg/build/buildcmd/cmd_sdk.go and pkg/release/sdk.go to use pkg/sdk
- Remove all non-library content: main.go, internal/, cmd/, docker/,
scripts/, tasks/, tools/, .core/, .forgejo/, .woodpecker/, Taskfile.yml
- Run go mod tidy to trim unused dependencies
core/go is now a pure Go package suite (library only).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude <developers@lethean.io>
Reviewed-on: #3
The Go wrapper was tracking inter-array references via desc.inputs,
creating chains that kept all intermediate arrays alive across requests.
After 3-4 requests, Metal memory grew to 170GB+ and macOS killed the
process.
Fix: remove desc.inputs/numRefs entirely. MLX-C has its own internal
reference counting — when Go GC finalizes an Array wrapper, it calls
mlx_array_free which decrements the C-side refcount. If the C-side
count reaches 0, Metal memory is freed. Go GC + MLX-C refcounting
together handle all lifecycle management correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Go GC cannot see Metal/C memory pressure, so intermediate arrays from
each forward pass accumulated without bound, causing OOM kills after
3-4 requests. Fix: runtime.SetFinalizer on every Array releases C
handles when GC collects them, and runtime.GC() is forced every 4
tokens during generation. Also adds SetMemoryLimit(24GB) as a hard
Metal ceiling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ClearCache() wrapping mlx_clear_cache
- Clear Metal allocator cache every 8 tokens during generation
- Set 16GB cache limit on backend init
- Prevents GPU memory from growing unbounded during inference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gemma 3 tokenizer.json uses [["a","b"],...] format for merges
instead of the ["a b",...] format. Support both.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Supports both multimodal (Gemma3ForConditionalGeneration) and
text-only configs. Resolves weights with language_model. prefix
fallback. Computes head_dim from hidden_size when missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use pure C callback instead of //export to avoid const char* vs
GoString type mismatch in cgo-generated headers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use _axis/_axes variants for softmax, argmax, topk, sum, mean, squeeze,
concatenate, argpartition
- Fix size_t vs int for count parameters throughout
- Fix int64_t strides in as_strided
- Add mlx_optional_int + mode param to quantized_matmul
- Use mlx_array_new() for null arrays (freqs, key, mask, sinks)
- Fix expand_dims to single-axis signature
- Fix compile callback signature (size_t index)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CGo wrapper for mlx-c providing zero-Python Metal GPU inference.
Includes Gemma 3 model architecture, BPE tokenizer, KV cache,
composable sampling, and OpenAI-compatible serve command.
Build-tagged (darwin && arm64 && mlx) with stubs for cross-platform.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>