go-mlx/internal/metal/detach.cpp at 2edb45e2c5ab3744b6808f4971bc8971f11cfc6d - core/go-mlx - Lethean Network

core/go-mlx

Snider 71fe4bb5ac fix: add Detach/Free calls to reduce Metal GPU memory retention

Add deterministic memory cleanup across inference paths:
- Detach logits after Eval to release graph references
- Free intermediate arrays in attention (gemma3, qwen3)
- Add cache Detach helper for KV cache cleanup after generation
- New detach.cpp/go CGO bindings for mlx_array_detach

Reduces 4B model memory from 78GB to ~17GB (vs 2.4GB mlx-lm baseline).
Native Metal memory management still trails Python refcounting but is
now viable for 1B models.

Co-Authored-By: Virgil <virgil@lethean.io>

2026-02-26 05:14:09 +00:00

8 lines

188 B

C++

Raw Blame History

 #include "mlx/mlx.h"
 #include "mlx/c/array.h"
 extern "C" void mlx_array_detach_impl(mlx_array arr) {
     if (arr.ctx) {
         static_cast<mlx::core::array*>(arr.ctx)->detach();
     }
 }