Security Scan / security (push) Successful in 15s

Details

Test / Vet & Build (push) Successful in 50s

Details

docs: archive completed backend-abstraction and batch-inference plans

Move both plans to docs/plans/completed/ with summaries. MLX backend
implements shared interfaces and batch inference at 5K sentences/sec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 13:51:21 +00:00

1 KiB

Raw Blame History

Batch Inference — Completion Summary

Completed: 19 February 2026 Module: forge.lthn.ai/core/go-mlx Status: Complete — 5K sentences/sec classification, integrated with go-i18n

What Was Built

Added batch inference capabilities to the MLX backend for high-throughput classification and generation.

Components

Classify() — prefill-only mode for single-token classification (domain labelling). No autoregressive generation needed.
BatchGenerate() — autoregressive batch generation with attention masking for padded sequences in variable-length batches.
Attention masking — correct handling of padded batches so shorter sequences don't attend to padding tokens.

Performance

5,000 sentences/sec for classification on M3 Ultra (prefill-only)
Native Metal execution via Go→CGo→mlx-c pipeline

Integration

Used by go-i18n 1B Pre-Sort Pipeline (Phase 2a) to batch-classify 88K seeds through Gemma3-1B at 80 prompts/sec (constrained by prompt construction, not inference).

1 KiB Raw Blame History