docs: archive completed backend-abstraction and batch-inference plans
Move both plans to docs/plans/completed/ with summaries. MLX backend implements shared interfaces and batch inference at 5K sentences/sec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
c0f07478c8
commit
421d0c42ff
5 changed files with 57 additions and 0 deletions
27
docs/plans/completed/backend-abstraction.md
Normal file
27
docs/plans/completed/backend-abstraction.md
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
# Backend Abstraction — Completion Summary
|
||||
|
||||
**Completed:** 19 February 2026
|
||||
**Module:** `forge.lthn.ai/core/go-mlx`
|
||||
**Status:** Complete — shared go-inference interfaces, Metal auto-registration
|
||||
|
||||
## What Was Built
|
||||
|
||||
Migrated go-mlx to implement shared `go-inference` interfaces so it
|
||||
plugs into the unified ML backend system alongside HTTP and Llama backends.
|
||||
|
||||
### Key changes
|
||||
|
||||
- `InferenceAdapter` implements `inference.Backend` interface
|
||||
- Metal backend auto-registers via `init()` when CGo is available
|
||||
- `Result` struct carries text + `Metrics` (tokens, latency, tokens/sec)
|
||||
- Model loading, tokenization, and generation all behind interface methods
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
go-ml (orchestrator)
|
||||
→ go-inference (interfaces)
|
||||
→ go-mlx (Metal/MLX backend, auto-registered)
|
||||
→ llama (llama.cpp backend)
|
||||
→ http (Ollama/OpenAI backend)
|
||||
```
|
||||
30
docs/plans/completed/batch-inference.md
Normal file
30
docs/plans/completed/batch-inference.md
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
# Batch Inference — Completion Summary
|
||||
|
||||
**Completed:** 19 February 2026
|
||||
**Module:** `forge.lthn.ai/core/go-mlx`
|
||||
**Status:** Complete — 5K sentences/sec classification, integrated with go-i18n
|
||||
|
||||
## What Was Built
|
||||
|
||||
Added batch inference capabilities to the MLX backend for high-throughput
|
||||
classification and generation.
|
||||
|
||||
### Components
|
||||
|
||||
- **`Classify()`** — prefill-only mode for single-token classification
|
||||
(domain labelling). No autoregressive generation needed.
|
||||
- **`BatchGenerate()`** — autoregressive batch generation with attention
|
||||
masking for padded sequences in variable-length batches.
|
||||
- **Attention masking** — correct handling of padded batches so shorter
|
||||
sequences don't attend to padding tokens.
|
||||
|
||||
### Performance
|
||||
|
||||
- 5,000 sentences/sec for classification on M3 Ultra (prefill-only)
|
||||
- Native Metal execution via Go→CGo→mlx-c pipeline
|
||||
|
||||
### Integration
|
||||
|
||||
Used by go-i18n 1B Pre-Sort Pipeline (Phase 2a) to batch-classify 88K
|
||||
seeds through Gemma3-1B at 80 prompts/sec (constrained by prompt
|
||||
construction, not inference).
|
||||
Loading…
Add table
Reference in a new issue