From 421d0c42ffe68c6ebad720e86aa3f3af61d03c46 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 24 Feb 2026 13:51:21 +0000 Subject: [PATCH] docs: archive completed backend-abstraction and batch-inference plans Move both plans to docs/plans/completed/ with summaries. MLX backend implements shared interfaces and batch inference at 5K sentences/sec. Co-Authored-By: Claude Opus 4.6 --- ...19-backend-abstraction-design-original.md} | 0 ...2-19-backend-abstraction-plan-original.md} | 0 ...-02-19-batch-inference-design-original.md} | 0 docs/plans/completed/backend-abstraction.md | 27 +++++++++++++++++ docs/plans/completed/batch-inference.md | 30 +++++++++++++++++++ 5 files changed, 57 insertions(+) rename docs/plans/{2026-02-19-backend-abstraction-design.md => completed/2026-02-19-backend-abstraction-design-original.md} (100%) rename docs/plans/{2026-02-19-backend-abstraction-plan.md => completed/2026-02-19-backend-abstraction-plan-original.md} (100%) rename docs/plans/{2026-02-19-batch-inference-design.md => completed/2026-02-19-batch-inference-design-original.md} (100%) create mode 100644 docs/plans/completed/backend-abstraction.md create mode 100644 docs/plans/completed/batch-inference.md diff --git a/docs/plans/2026-02-19-backend-abstraction-design.md b/docs/plans/completed/2026-02-19-backend-abstraction-design-original.md similarity index 100% rename from docs/plans/2026-02-19-backend-abstraction-design.md rename to docs/plans/completed/2026-02-19-backend-abstraction-design-original.md diff --git a/docs/plans/2026-02-19-backend-abstraction-plan.md b/docs/plans/completed/2026-02-19-backend-abstraction-plan-original.md similarity index 100% rename from docs/plans/2026-02-19-backend-abstraction-plan.md rename to docs/plans/completed/2026-02-19-backend-abstraction-plan-original.md diff --git a/docs/plans/2026-02-19-batch-inference-design.md b/docs/plans/completed/2026-02-19-batch-inference-design-original.md similarity index 100% rename from docs/plans/2026-02-19-batch-inference-design.md rename to docs/plans/completed/2026-02-19-batch-inference-design-original.md diff --git a/docs/plans/completed/backend-abstraction.md b/docs/plans/completed/backend-abstraction.md new file mode 100644 index 0000000..a5228a7 --- /dev/null +++ b/docs/plans/completed/backend-abstraction.md @@ -0,0 +1,27 @@ +# Backend Abstraction — Completion Summary + +**Completed:** 19 February 2026 +**Module:** `forge.lthn.ai/core/go-mlx` +**Status:** Complete — shared go-inference interfaces, Metal auto-registration + +## What Was Built + +Migrated go-mlx to implement shared `go-inference` interfaces so it +plugs into the unified ML backend system alongside HTTP and Llama backends. + +### Key changes + +- `InferenceAdapter` implements `inference.Backend` interface +- Metal backend auto-registers via `init()` when CGo is available +- `Result` struct carries text + `Metrics` (tokens, latency, tokens/sec) +- Model loading, tokenization, and generation all behind interface methods + +### Architecture + +``` +go-ml (orchestrator) + → go-inference (interfaces) + → go-mlx (Metal/MLX backend, auto-registered) + → llama (llama.cpp backend) + → http (Ollama/OpenAI backend) +``` diff --git a/docs/plans/completed/batch-inference.md b/docs/plans/completed/batch-inference.md new file mode 100644 index 0000000..8b5989e --- /dev/null +++ b/docs/plans/completed/batch-inference.md @@ -0,0 +1,30 @@ +# Batch Inference — Completion Summary + +**Completed:** 19 February 2026 +**Module:** `forge.lthn.ai/core/go-mlx` +**Status:** Complete — 5K sentences/sec classification, integrated with go-i18n + +## What Was Built + +Added batch inference capabilities to the MLX backend for high-throughput +classification and generation. + +### Components + +- **`Classify()`** — prefill-only mode for single-token classification + (domain labelling). No autoregressive generation needed. +- **`BatchGenerate()`** — autoregressive batch generation with attention + masking for padded sequences in variable-length batches. +- **Attention masking** — correct handling of padded batches so shorter + sequences don't attend to padding tokens. + +### Performance + +- 5,000 sentences/sec for classification on M3 Ultra (prefill-only) +- Native Metal execution via Go→CGo→mlx-c pipeline + +### Integration + +Used by go-i18n 1B Pre-Sort Pipeline (Phase 2a) to batch-classify 88K +seeds through Gemma3-1B at 80 prompts/sec (constrained by prompt +construction, not inference).