Move both plans to docs/plans/completed/ with summaries. MLX backend implements shared interfaces and batch inference at 5K sentences/sec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
30 lines
1 KiB
Markdown
30 lines
1 KiB
Markdown
# Batch Inference — Completion Summary
|
|
|
|
**Completed:** 19 February 2026
|
|
**Module:** `forge.lthn.ai/core/go-mlx`
|
|
**Status:** Complete — 5K sentences/sec classification, integrated with go-i18n
|
|
|
|
## What Was Built
|
|
|
|
Added batch inference capabilities to the MLX backend for high-throughput
|
|
classification and generation.
|
|
|
|
### Components
|
|
|
|
- **`Classify()`** — prefill-only mode for single-token classification
|
|
(domain labelling). No autoregressive generation needed.
|
|
- **`BatchGenerate()`** — autoregressive batch generation with attention
|
|
masking for padded sequences in variable-length batches.
|
|
- **Attention masking** — correct handling of padded batches so shorter
|
|
sequences don't attend to padding tokens.
|
|
|
|
### Performance
|
|
|
|
- 5,000 sentences/sec for classification on M3 Ultra (prefill-only)
|
|
- Native Metal execution via Go→CGo→mlx-c pipeline
|
|
|
|
### Integration
|
|
|
|
Used by go-i18n 1B Pre-Sort Pipeline (Phase 2a) to batch-classify 88K
|
|
seeds through Gemma3-1B at 80 prompts/sec (constrained by prompt
|
|
construction, not inference).
|