Add deterministic memory cleanup across inference paths: - Detach logits after Eval to release graph references - Free intermediate arrays in attention (gemma3, qwen3) - Add cache Detach helper for KV cache cleanup after generation - New detach.cpp/go CGO bindings for mlx_array_detach Reduces 4B model memory from 78GB to ~17GB (vs 2.4GB mlx-lm baseline). Native Metal memory management still trails Python refcounting but is now viable for 1B models. Co-Authored-By: Virgil <virgil@lethean.io>
8 lines
188 B
C++
8 lines
188 B
C++
#include "mlx/mlx.h"
|
|
#include "mlx/c/array.h"
|
|
|
|
extern "C" void mlx_array_detach_impl(mlx_array arr) {
|
|
if (arr.ctx) {
|
|
static_cast<mlx::core::array*>(arr.ctx)->detach();
|
|
}
|
|
}
|