go-mlx/docs/plans/completed/batch-inference.md

# Batch Inference — Completion Summary

**Completed:** 19 February 2026
**Module:** `forge.lthn.ai/core/go-mlx`
**Status:** Complete — 5K sentences/sec classification, integrated with go-i18n

## What Was Built

Added batch inference capabilities to the MLX backend for high-throughput
classification and generation.

### Components

- **`Classify()`** — prefill-only mode for single-token classification
  (domain labelling). No autoregressive generation needed.
- **`BatchGenerate()`** — autoregressive batch generation with attention
  masking for padded sequences in variable-length batches.
- **Attention masking** — correct handling of padded batches so shorter
  sequences don't attend to padding tokens.

### Performance

- 5,000 sentences/sec for classification on M3 Ultra (prefill-only)
- Native Metal execution via Go→CGo→mlx-c pipeline

### Integration

Used by go-i18n 1B Pre-Sort Pipeline (Phase 2a) to batch-classify 88K
seeds through Gemma3-1B at 80 prompts/sec (constrained by prompt
construction, not inference).