docs: archive completed backend-abstraction and batch-inference plans
All checks were successful
Security Scan / security (push) Successful in 15s
Test / Vet & Build (push) Successful in 50s

Move both plans to docs/plans/completed/ with summaries. MLX backend
implements shared interfaces and batch inference at 5K sentences/sec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Claude 2026-02-24 13:51:21 +00:00
parent c0f07478c8
commit 421d0c42ff
No known key found for this signature in database
GPG key ID: AF404715446AEB41
5 changed files with 57 additions and 0 deletions

View file

@ -0,0 +1,27 @@
# Backend Abstraction — Completion Summary
**Completed:** 19 February 2026
**Module:** `forge.lthn.ai/core/go-mlx`
**Status:** Complete — shared go-inference interfaces, Metal auto-registration
## What Was Built
Migrated go-mlx to implement shared `go-inference` interfaces so it
plugs into the unified ML backend system alongside HTTP and Llama backends.
### Key changes
- `InferenceAdapter` implements `inference.Backend` interface
- Metal backend auto-registers via `init()` when CGo is available
- `Result` struct carries text + `Metrics` (tokens, latency, tokens/sec)
- Model loading, tokenization, and generation all behind interface methods
### Architecture
```
go-ml (orchestrator)
→ go-inference (interfaces)
→ go-mlx (Metal/MLX backend, auto-registered)
→ llama (llama.cpp backend)
→ http (Ollama/OpenAI backend)
```

View file

@ -0,0 +1,30 @@
# Batch Inference — Completion Summary
**Completed:** 19 February 2026
**Module:** `forge.lthn.ai/core/go-mlx`
**Status:** Complete — 5K sentences/sec classification, integrated with go-i18n
## What Was Built
Added batch inference capabilities to the MLX backend for high-throughput
classification and generation.
### Components
- **`Classify()`** — prefill-only mode for single-token classification
(domain labelling). No autoregressive generation needed.
- **`BatchGenerate()`** — autoregressive batch generation with attention
masking for padded sequences in variable-length batches.
- **Attention masking** — correct handling of padded batches so shorter
sequences don't attend to padding tokens.
### Performance
- 5,000 sentences/sec for classification on M3 Ultra (prefill-only)
- Native Metal execution via Go→CGo→mlx-c pipeline
### Integration
Used by go-i18n 1B Pre-Sort Pipeline (Phase 2a) to batch-classify 88K
seeds through Gemma3-1B at 80 prompts/sec (constrained by prompt
construction, not inference).