From 5fb98dcedd9cc4039255e857354bce7e76ff9253 Mon Sep 17 00:00:00 2001
From: Snider <snider@host.uk.com>
Date: Fri, 20 Feb 2026 15:01:55 +0000
Subject: [PATCH] docs: graduate TODO/FINDINGS into production documentation

Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
---
 CLAUDE.md            |  12 +-
 FINDINGS.md          | 316 ---------------------------------------
 TODO.md              |  58 --------
 docs/architecture.md | 342 +++++++++++++++++++++++++++++++++++++++++++
 docs/development.md  | 192 ++++++++++++++++++++++++
 docs/history.md      | 238 ++++++++++++++++++++++++++++++
 6 files changed, 776 insertions(+), 382 deletions(-)
 delete mode 100644 FINDINGS.md
 delete mode 100644 TODO.md
 create mode 100644 docs/architecture.md
 create mode 100644 docs/development.md
 create mode 100644 docs/history.md

diff --git a/CLAUDE.md b/CLAUDE.md
index 78def92..fb3412d 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -11,7 +11,7 @@ This is a **grammar engine** — it provides primitives for composing and revers
 ```bash
 go test ./...                    # Run all tests
 go test -v ./reversal/           # Reversal engine tests
-go test -bench=. ./...           # Benchmarks (when added)
+go test -bench=. ./...           # Benchmarks
 ```
 
 ## Critical Rules
@@ -43,7 +43,9 @@ go-i18n provides grammar primitives. Apps using it (core/cli, etc.) manage their
 | Root | Forward composition: T(), grammar primitives, handlers, service |
 | `reversal/` | Reverse grammar: tokeniser, imprint, multiplier |
 | `locales/` | Grammar tables (JSON) — only `gram.*` data |
-| `docs/plans/` | Design documents |
+| `docs/` | Architecture, development, history, grammar-table-spec, plans |
+
+See `docs/architecture.md` for full technical detail.
 
 ## Key API
 
@@ -59,9 +61,3 @@ go-i18n provides grammar primitives. Apps using it (core/cli, etc.) manage their
 - `go test ./...` must pass before commit
 - Conventional commits: `type(scope): description`
 - Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
-
-## Task Queue
-
-See `TODO.md` for dispatched tasks from core/go orchestration.
-See `FINDINGS.md` for research notes and architectural decisions.
-See the [wiki](https://forge.lthn.ai/core/go-i18n/wiki) for full architecture docs.
diff --git a/FINDINGS.md b/FINDINGS.md
deleted file mode 100644
index 4d9a937..0000000
--- a/FINDINGS.md
+++ /dev/null
@@ -1,316 +0,0 @@
-# FINDINGS.md — go-i18n Research & Discovery
-
-Record findings, gaps, and architectural decisions here as work progresses.
-
----
-
-## 2026-02-19: Initial Assessment (Virgil)
-
-### Current State
-
-- 5,800 lines across 32 files (14 test files)
-- All tests pass
-- Only dependency: `golang.org/x/text`
-- Grammar engine is solid: forward composition + reversal + imprint + multiplier
-
-### Architecture
-
-go-i18n is a **grammar engine**, not a translation file manager. Consumers bring their own translations. The library provides:
-
-1. **Forward composition** — Grammar primitives that compose grammatically correct text
-2. **Reverse grammar** — Tokeniser reads grammar tables backwards to extract structure
-3. **GrammarImprint** — Lossy feature vector projection (content to grammar fingerprint)
-4. **Multiplier** — Deterministic training data augmentation (no LLM calls)
-
-### Key Gaps Identified
-
-| Gap | Impact | Notes |
-|-----|--------|-------|
-| No CLAUDE.md | High | Agents don't know the rules, will flatten locale files |
-| Dual-class word ambiguity | Medium | "file" as verb vs noun, "run" as verb vs noun |
-| No benchmarks | Medium | No perf baselines for hot-path usage (TIM, Poindexter) |
-| No reference distributions | High | Can't calibrate imprints without scored seed data |
-| Only English grammar tables | Medium | Reversal only works with loaded GrammarData |
-
-### Sacred Rules
-
-- `gram.*` keys in locale JSON MUST remain nested — flattening breaks the grammar engine
-- Irregular forms in grammar tables override regular morphological rules
-- Round-trip property must hold: forward(base) then reverse must recover base
-- go-i18n does NOT ship or manage consumer translation files
-
----
-
-## 2026-02-19: LEK-Gemma3-1B-v2 Benchmark (Virgil)
-
-Tested the fine-tuned 1B model (`/Volumes/Data/lem/LEM-Gemma3-1B-layered-v2`) across three progressively tighter evaluation rounds to find where it provides real value for the grammar pipeline.
-
-### Round 1: Practical Dev Tasks (5 tasks, temp=0.3, max_tokens=512)
-
-Open-ended dev work — bug spotting, commit messages, Go functions, grammar tables, code review. Results: mostly misses. The model hallucinates APIs, generates pad-token degeneration on longer output, and can't reliably write code. Not useful for generative tasks.
-
-### Round 2: Narrow Constrained Tasks (24 tasks, temp=0.1, avg 0.19s/task)
-
-Tighter format — one-word answers, forced categories, fill-in-the-blank.
-
-| Category | Score | Notes |
-|----------|-------|-------|
-| Domain classification | 3/4 (75%) | Promising — called technical, creative, ethical, casual correctly |
-| Article selection | 2/3 (67%) | Got "an API" and "an SSH" right, missed "a URL" |
-| Tense detection | 2/4 (50%) | Weak on gerund vs base form |
-| Plural detection | 2/3 (67%) | Got "matrices" and "datum", confused on "sheep" |
-| Conjugation | Mixed | Some correct, many hallucinated forms |
-
-### Round 3: Tightest Format (27 tasks, temp=0.05, avg 0.18s/task)
-
-Binary T/F, forced A/B choice, single-word domain/tone classification.
-
-| Category | Score | Notes |
-|----------|-------|-------|
-| Domain classification | 6/8 (75%) | Consistent with Round 2 — this is the sweet spot |
-| Article correctness T/F | 3/3 (100%) | Perfect on "an SSH", "a API"→false, "an URL"→false |
-| Tone/sentiment | 2/3 (67%) | Got positive + negative, neutral confused |
-| Irregular base forms A/B | 2/2 (100%) | "went"→go, "mice"→mouse — strong |
-| True/False grammar | 4/8 (50%) | Strong false-bias — says "false" when unsure |
-| Pattern fill | 0/4 (0%) | Echoes prompt or hallucinates — dead zone |
-
-### Key Finding: Domain Classification at Scale
-
-**Domain classification at 75% accuracy in 0.17s is genuinely useful.** At that speed, one M3 can pre-sort ~5,000 sentences/second across {technical, creative, ethical, casual}. For the 88K Phase 0 seed corpus, that's ~18 seconds to pre-tag everything.
-
-The technical↔creative confusion (calls some technical text "creative") is the main error pattern — likely fixable with targeted fine-tuning examples showing code/CLI commands vs literary prose.
-
-### Implications for Phase 2
-
-1. **Pre-sort pipeline**: Run 1B domain classification before heavier GrammarImprint analysis. Pre-tagged seeds reduce imprint compute by letting us batch by domain.
-2. **Calibration target**: 1B classifications can be spot-checked against Gemma3-27B classifications to measure drift.
-3. **Article/irregular strength**: The 100% article correctness and irregular base form accuracy suggest these grammar features are well-learned. Worth testing as lightweight validators in the forward composition path.
-4. **Dead zones to avoid**: Don't use 1B for pattern fill, tense detection, or generative tasks. These need the full 27B or rule-based approaches.
-
----
-
-## 2026-02-19: Dual-Class Word Disambiguation (Implementation)
-
-### What Was Built
-
-Two-pass probabilistic disambiguation for words that exist as both verbs and nouns in the grammar tables. Replaces the previous verb-first hard classification with context-aware scoring.
-
-### Dual-Class Set
-
-{commit, run, test, check, file, build} — all 6 words now appear in both `gram.verb` and `gram.noun` in en.json.
-
-### Algorithm: Two-Pass Tokenise
-
-**Pass 1** classifies unambiguous tokens. Inflected forms self-resolve (e.g. "committed" → verb, "commits" → noun). Base forms of dual-class words are marked as ambiguous.
-
-**Pass 2** evaluates 7 weighted signals to resolve ambiguous tokens:
-
-| Signal | Weight | Description |
-|--------|--------|-------------|
-| noun_determiner | 0.35 | Preceded by "the", "a", "my", etc. → noun |
-| verb_auxiliary | 0.25 | Preceded by "will", "can", "don't", etc. → verb |
-| following_class | 0.15 | Followed by noun → verb; followed by verb → noun |
-| sentence_position | 0.10 | Sentence-initial → verb (imperative) |
-| verb_saturation | 0.10 | Clause already has a confident verb → noun |
-| inflection_echo | 0.03 | Inflected form of same word elsewhere → other role |
-| default_prior | 0.02 | Verb-first tiebreaker |
-
-### Key Design Decisions
-
-- **Confidence scores** flow into imprints: dual-class tokens contribute to both verb and noun distributions weighted by Confidence and AltConf. This preserves uncertainty for scoring rather than forcing a hard classification.
-- **Clause boundaries** for verb saturation: scans only within clause (delimited by punctuation and coordinating conjunctions "and", "or", "but"). Prevents multi-clause sentences from incorrectly pushing second verbs toward noun.
-- **Confidence floor** (B3): when only the default prior fires (total < 0.10), confidence is capped at 0.55/0.45 rather than deriving a misleading 1.0 from `0.02/0.02`.
-- **Contractions** (D1): 15 contractions added to verb_auxiliary signal list (don't, can't, won't, etc.).
-- **Configurable weights** (F3): `WithWeights()` option allows overriding signal weights without code changes.
-- **DisambiguationStats** (F1): `DisambiguationStatsFromTokens()` provides aggregate stats for Phase 2 calibration.
-- **SignalBreakdown** opt-in: `WithSignals()` populates detailed per-token signal diagnostics.
-
-### Post-Implementation Cleanup (R1-R3)
-
-- **R1**: Removed "passed", "failed", "skipped" from `gram.noun` and `gram.word` — these are past participles, not nouns. Prevents future dual-class false positives when verb coverage expands.
-- **R3**: `buildSignalIndex` now guards each signal list independently. Partial locale data falls back per-field rather than silently disabling signals for locales with incomplete `gram.signal` blocks.
-
-### Test Coverage
-
-- 9 disambiguation scenario tests (noun after determiner, verb imperative, verb saturation, clause boundary, contraction aux, etc.)
-- 12 dual-class round-trip tests covering all 6 words in both roles
-- Imprint convergence test (same-role similar, different-role divergent)
-- DisambiguationStats tests in tokeniser_test.go (ambiguous and non-ambiguous inputs)
-- WithWeights override test (zeroing noun_determiner flips classification)
-- Race detector: clean
-
-### Expanded Dual-Class Candidates (Phase 2)
-
-Per REVIEW.md F4, additional candidates for future expansion: patch, release, update, change, merge, push, pull, tag, log, watch, link, host, import, export, process, function, handle, trigger, stream, queue. Measure which ones cause imprint drift in the 88K seeds before adding.
-
----
-
-## 2026-02-19: Irregular Verb Coverage Extension
-
-Added 44 irregular verbs to `irregularVerbs` map in `types.go`:
-
-- **17 compound irregular** (prefix + base): undo, redo, rerun, rewrite, rebuild, resend, override, rethink, remake, undergo, overcome, withdraw, uphold, withhold, outgrow, outrun, overshoot
-- **22 simple irregular** (dev/ops): become, come, give, fall, understand, arise, bind, spin, quit, cast, broadcast, burst, cost, shed, rid, shrink, shoot, forbid, offset, upset, input, output
-- **5 CVC doubling overrides**: debug, embed, unzip, remap, unpin, unwrap — these have stressed final syllable but `shouldDoubleConsonant()` returns false for words >4 chars
-
-Total irregular verb count: ~140 entries (from ~96).
-
----
-
-## 2026-02-19: Benchmark Baselines (M3 Ultra, arm64)
-
-### Forward Composition (`grammar_test.go`)
-
-| Benchmark | ns/op | allocs/op | B/op |
-|-----------|-------|-----------|------|
-| PastTense (irregular) | 25.67 | 0 | 0 |
-| PastTense (regular) | 48.52 | 1 | 8 |
-| PastTense (compound) | 26.15 | 0 | 0 |
-| Gerund | 25.87 | 0 | 0 |
-| Pluralize | 67.97 | 1 | 16 |
-| Article | 177.4 | 0 | 0 |
-| Progress | 107.1 | 2 | 24 |
-| ActionResult | 115.3 | 3 | 48 |
-
-### Reversal Engine (`reversal/tokeniser_test.go`)
-
-| Benchmark | ns/op | allocs/op | B/op |
-|-----------|-------|-----------|------|
-| Tokenise (3 words) | 639 | 8 | 1600 |
-| Tokenise (12 words) | 2859 | 14 | 7072 |
-| Tokenise (dual-class) | 1657 | 9 | 3472 |
-| Tokenise (WithSignals) | 2255 | 28 | 4680 |
-| NewImprint | 648 | 10 | 1120 |
-| Imprint.Similar | 516 | 0 | 0 |
-| Multiplier.Expand | 3609 | 63 | 11400 |
-
-### Key Observations
-
-- **Forward composition is fast**: irregular verb lookup is ~26ns (map lookup), regular ~49ns (string manipulation). Both are hot-path safe.
-- **Tokenise scales linearly**: ~200-240ns/word. 12-word sentence at 2.9µs means ~350K sentences/sec single-threaded.
-- **Similar is zero-alloc**: 516ns with no heap allocation makes it suitable for high-volume imprint comparison.
-- **Multiplier is allocation-heavy**: 63 allocs for a 4-word sentence. If this becomes a bottleneck, pool the Token slices.
-- **WithSignals adds overhead**: ~36% more time and 3x allocs vs plain tokenise. Keep it opt-in for diagnostics only.
-
----
-
-## 2026-02-19: Classification Benchmark Results
-
-220 domain-tagged sentences (55/domain) classified via leave-one-out imprint similarity.
-
-| Domain | Accuracy | Token Coverage | Tense Signature |
-|--------|----------|---------------|-----------------|
-| Technical | 78.2% | 69.4% | base=46%, gerund=30%, past=24% |
-| Creative | 81.8% | 46.5% | past=80%, gerund=16%, base=4% |
-| Ethical | 45.5% | 34.0% | base=55%, past=25%, gerund=20% |
-| Casual | 10.9% | 39.1% | past=70%, base=17%, gerund=14% |
-
-**Overall: 54.1%** (vs 25% random chance)
-
-### Confusion Axes
-
-1. **Ethical → Technical** (16/55 misclassified): Both domains use base-form verbs heavily (imperative vs prescriptive). Grammar features alone cannot distinguish "Delete the file" from "We should find a fair solution" — both register as base-form verb + noun patterns.
-
-2. **Casual → Creative** (39/55 misclassified): Both domains use past tense heavily (narrative vs anecdotal). "She wrote the story by candlelight" and "She made dinner for everyone" have identical grammar profiles.
-
-### Implication for Phase 2a
-
-Grammar-based classification is a strong first pass for technical (78%) and creative (82%). The 1B model is specifically needed to resolve:
-- ethical vs technical — likely needs semantic understanding of modal/prescriptive framing
-- casual vs creative — likely needs vocabulary complexity or formality signals
-
-### Dependency: go-inference + go-mlx (RESOLVED)
-
-~~Phase 2a tasks blocked on go-ai~~ — resolved via direct go-inference + go-mlx imports. Gemma3-1B inference validated.
-
----
-
-## 2026-02-20: 1B Pre-Sort Pipeline
-
-`ClassifyCorpus()` added to `classify.go`. Streaming JSONL → batch Classify → augmented JSONL with `domain_1b` field.
-
-### Integration Test Results (Gemma3-1B, M3 Ultra, 4-bit quantised)
-
-- 50 prompts classified in 625ms (80 prompts/sec)
-- All 50 technical prompts correctly classified as "technical"
-- Model load time: ~1s
-- Batch size 8, single-token generation (WithMaxTokens(1))
-
-### Throughput vs Target
-
-Target was 152 prompts/sec (from go-mlx benchmarks). Observed 80 prompts/sec with 50-prompt run. The difference is likely startup overhead amortisation — with 88K prompts the throughput should approach the benchmark figure as batch pipeline reaches steady state. Estimated 88K corpus processing time: ~15 minutes (vs 10 minute target).
-
-### Architecture
-
-- `ClassifyCorpus(ctx, model, input, output, opts...)` — caller manages model lifecycle
-- `mapTokenToDomain(token)` — prefix-match on model output: tech→technical, cre→creative, eth→ethical, cas→casual
-- Configurable: `WithBatchSize(n)`, `WithPromptField(field)`, `WithPromptTemplate(tmpl)`
-- Mock-friendly via `inference.TextModel` interface — 3 unit tests with mock, 1 integration test with real model
-
----
-
-## 2026-02-20: 1B vs 27B Calibration Tool
-
-`CalibrateDomains()` added to `calibrate.go`. Accepts two `inference.TextModel` instances (typically 1B and 27B) plus a corpus of `CalibrationSample` entries, classifies all with both models, and computes agreement metrics.
-
-### Design
-
-- **Sequential batch classification**: classifies entire corpus with model A first, then model B, via `classifyAll()` — manages memory safely for large models
-- **Ground truth support**: samples with non-empty `TrueDomain` contribute to per-model accuracy metrics
-- **Agreement metrics**: total/agreed count, agreement rate, per-domain distribution for each model, confusion pairs (A→B direction), accuracy vs ground truth
-- **Reuses existing infrastructure**: same `mapTokenToDomain()`, same `ClassifyOption` config, same batch pattern as `ClassifyCorpus()`
-
-### Integration Test Corpus (500 samples)
-
-- 220 ground-truth sentences (55/domain × 4 domains) — expanded from benchmark corpus with additional realistic examples
-- 280 unlabelled diverse sentences — mixed register (technical/creative/ethical/casual interleaved) for agreement-only measurement
-- Soft assertion: agreement rate > 50% (models should share classification semantics)
-
----
-
-## 2026-02-20: Reference Distributions & Anomaly Detection
-
-### Reference Distribution Builder (`reversal/reference.go`)
-
-`BuildReferences()` tokenises classified samples, builds `GrammarImprint` for each, groups by domain, and computes:
-
-- **Centroid imprint**: per-map-field accumulation then L1 normalisation, scalar fields averaged
-- **Per-key variance**: sample variance across all imprints per domain, prefixed by component ("verb:", "tense:", "noun:", "article:", "punct:")
-- **ReferenceSet**: holds all domain `ReferenceDistribution` entries, provides `Compare()`, `Classify()`, `DetectAnomalies()`, `DomainNames()`
-
-### Imprint Comparator
-
-Three distance metrics between an imprint and each reference centroid:
-
-| Metric | Method | Notes |
-|--------|--------|-------|
-| Cosine similarity | Reuses `Similar()` | Weighted by component (verb 0.30, tense 0.20, noun 0.25, article 0.15, punct 0.10) |
-| KL divergence | Symmetric (Jensen-Shannon style) | Epsilon-smoothed (1e-10), same component weights |
-| Mahalanobis | Variance-normalised Euclidean | Uses per-key variance from reference set, falls back to unit variance when missing |
-
-`Classify()` ranks domains by cosine similarity, returns best match + confidence margin (gap between 1st and 2nd).
-
-### Cross-Domain Anomaly Detection (`reversal/anomaly.go`)
-
-`DetectAnomalies()` compares model-assigned domain labels against imprint-based classification:
-
-- Tokenises each sample, builds imprint, classifies against references
-- Flags samples where model domain ≠ imprint domain
-- Returns per-sample `AnomalyResult` (text, both domains, confidence, anomaly flag) + `AnomalyStats` (rate, by-pair counts)
-- **Validated**: "She painted the sunset over the mountains" tagged as technical correctly flagged as creative anomaly (confidence 0.37)
-
-### Test Results
-
-| Test | Result | Key observation |
-|------|--------|----------------|
-| Technical sentence vs tech/creative centroids | tech_sim=0.38, creative_sim=0.17 | Grammar engine correctly distinguishes imperative from narrative |
-| Creative sentence mislabelled as technical | Flagged as anomaly | Anomaly detector works on real grammar profile differences |
-| Single-domain reference set | No false anomalies | Pipeline handles degenerate case cleanly |
-| KL of identical distributions | ~0.0 | Symmetric KL numerically stable |
-| Mahalanobis without variance | Falls back to Euclidean | Graceful degradation |
-
-### Implications
-
-1. **Grammar alone separates technical from creative** (sim gap 0.21) but ethical↔technical and casual↔creative overlap persists — confirms 1B model is needed for those axes
-2. **Anomaly detection is a training signal**: mislabelled samples from 1B classification can be flagged for human review or 27B verification
-3. **Reference distributions enable Poindexter integration**: the `ReferenceSet` API is the bridge between grammar analysis and the broader trust verification pipeline
diff --git a/TODO.md b/TODO.md
deleted file mode 100644
index e8828ac..0000000
--- a/TODO.md
+++ /dev/null
@@ -1,58 +0,0 @@
-# TODO.md — go-i18n Task Queue
-
-Dispatched from core/go orchestration. All phases complete as of 20 Feb 2026.
-
----
-
-## Phase 1: Harden the Engine — COMPLETE
-
-- [x] **Add CLAUDE.md** *(d5b3eac)*
-- [x] **Ambiguity resolution for dual-class words** — Two-pass probabilistic disambiguation, 7 weighted signals, 6 dual-class words *(3848297)*
-- [x] **Extend irregular verb coverage** — 44 new irregular verbs (17 compound, 22 simple, 5 CVC overrides), ~140 total
-- [x] **Add benchmarks** — 8 forward + 7 reversal. Baselines: PastTense 26ns, Tokenise 639ns, Imprint 648ns, Similar 516ns/0alloc
-
-## Phase 2a: 1B Pre-Classification — COMPLETE
-
-- [x] **Classification benchmark suite** — 220 domain-tagged sentences, grammar engine: 54.1% overall (tech 78%, creative 82%, ethical 46%, casual 11%)
-- [x] **1B pre-sort pipeline** — `ClassifyCorpus()` in `classify.go`, 80 prompts/sec on M3 Ultra, mock + integration tested
-- [x] **Virgil review fixes** — go.mod cleanup, prefix collision fix, short-mode skip, accuracy assertion (5 items)
-- [x] **1B vs 27B calibration** — `CalibrateDomains()` in `calibrate.go`, 500-sample corpus (220 ground-truth + 280 unlabelled), 7 mock tests *(3b7ef9d)*
-- [x] **Article/irregular validator** — Single-token classification via `m.Generate()` with temp=0.05
-
-## Phase 2b: Reference Distributions — COMPLETE
-
-- [x] **Reference distribution builder** — `BuildReferences()` in `reversal/reference.go`, per-domain centroid + variance *(c3e9153)*
-- [x] **Imprint comparator** — `Compare()` + `Classify()`, cosine/KL/Mahalanobis distance metrics *(c3e9153)*
-- [x] **Cross-domain anomaly detection** — `DetectAnomalies()` in `reversal/anomaly.go`, flags model vs imprint disagreements *(c3e9153)*
-
-## Phase 3: Multi-Language — COMPLETE
-
-- [x] **Grammar table format spec** — JSON schema in `docs/grammar-table-spec.md`
-- [x] **French grammar tables** — 50 verbs, 24 gendered nouns, gendered articles, punctuation spacing
-
----
-
-## Integration Tests (require real models on `/Volumes/Data/lem/`)
-
-```bash
-# 1B classification pipeline (50 prompts, ~1s)
-cd integration && go test -v -run TestClassifyCorpus_Integration
-
-# 1B vs 27B calibration (500 sentences, ~2-5min with 27B)
-cd integration && go test -v -run TestCalibrateDomains_1Bvs27B
-```
-
-## Future Work (not yet tasked)
-
-- **Expanded dual-class words** — 20 candidates: patch, release, update, change, merge, push, pull, tag, log, watch, link, host, import, export, process, function, handle, trigger, stream, queue. Measure imprint drift in 88K seeds first.
-- **French reversal** — Elision (l') and plural articles (les/des) need `Article()` extension.
-- **88K seed corpus processing** — Run full pre-sort + reference distribution build against LEM Phase 0 seeds.
-
----
-
-## Workflow
-
-1. Virgil in core/go writes tasks here after research
-2. This repo's session picks up tasks in phase order
-3. Mark `[x]` when done, note commit hash
-4. New discoveries → add tasks, flag in FINDINGS.md
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..875e8e2
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,342 @@
+# Architecture
+
+go-i18n is a grammar engine for Go. It is not a translation file manager. Consumers bring their own translation keys; the library provides grammatical primitives for composing and reversing grammatically correct text across multiple languages.
+
+Module: `forge.lthn.ai/core/go-i18n`
+
+---
+
+## Package Overview
+
+| Package | Purpose |
+|---------|---------|
+| Root (`i18n`) | Forward composition: grammar primitives, T(), handlers, service, loader |
+| `reversal/` | Reverse grammar: tokeniser, imprint, reference distributions, anomaly detection, multiplier |
+| `locales/` | Grammar tables (JSON) — only `gram.*` data |
+| `docs/` | Specifications and design documents |
+
+---
+
+## Forward Composition
+
+The root package composes grammatically correct text from base forms. Every public function follows a three-tier lookup:
+
+1. JSON grammar tables loaded for the current language (`gram.verb`, `gram.noun`)
+2. Go built-in irregular maps (`irregularVerbs`, `irregularNouns`)
+3. Regular morphological rules (algorithmic)
+
+### Grammar Primitives
+
+**`PastTense(verb string) string`**
+
+Applies the three-tier fallback to produce a simple past form. Irregular forms (e.g. `run` → `ran`, `build` → `built`) are resolved at tier 1 or 2; regular forms apply consonant-doubling, e/y-ending rules at tier 3. Benchmark: 26 ns/op for irregular (map lookup, zero allocs), 49 ns/op for regular (one string allocation).
+
+**`Gerund(verb string) string`**
+
+Produces the present participle. Handles `-ie` → `-ying` (e.g. `die` → `dying`), silent-e drop (e.g. `delete` → `deleting`), and consonant doubling (e.g. `run` → `running`).
+
+**`Pluralize(noun string, count int) string`**
+
+Returns singular when `count == 1`, delegates to `PluralForm()` otherwise.
+
+**`PluralForm(noun string) string`**
+
+Three-tier noun plural lookup. Regular rules handle sibilant (`+es`), consonant+y → ies, f/fe → ves, and default (`+s`).
+
+**`Article(word string) string`**
+
+Returns `"a"` or `"an"` based on phonetic rules. Handles exceptions in both directions: consonant-sound words starting with a vowel letter (e.g. `user` → `"a"`) and vowel-sound words starting with a consonant letter (e.g. `hour` → `"an"`). Implemented as prefix lookup tables for the known exceptions, falling back to first-letter vowel test.
+
+**Composite functions**
+
+`Progress(verb)` → `"Building..."`, `ProgressSubject(verb, subject)` → `"Building project..."`, `ActionResult(verb, subject)` → `"File deleted"`, `ActionFailed(verb, subject)` → `"Failed to delete file"`, `Label(word)` → `"Status:"`.
+
+All composite functions look up language-specific punctuation rules from `gram.punct` (`LabelSuffix`, `ProgressSuffix`) to handle differences such as the French space-before-colon convention.
+
+### T() and Key Handlers
+
+`T(key string, args ...any) string` is the entry point for translation. It passes the key through a chain of `KeyHandler` implementations before falling back to the message store. The `i18n.*` namespace is handled by built-in handlers that auto-compose output:
+
+| Key pattern | Output |
+|-------------|--------|
+| `i18n.label.<word>` | `Label(word)` |
+| `i18n.progress.<verb>` | `Progress(verb)` |
+| `i18n.count.<noun>` | `Pluralize(noun, n)` with count arg |
+| `i18n.done.<verb>` | `ActionResult(verb, subject)` |
+| `i18n.fail.<verb>` | `ActionFailed(verb, subject)` |
+
+### Grammar Data Loading
+
+`GrammarData` holds the parsed grammar tables for one language:
+
+```go
+type GrammarData struct {
+    Verbs    map[string]VerbForms
+    Nouns    map[string]NounForms
+    Articles ArticleForms
+    Words    map[string]string
+    Punct    PunctuationRules
+    Signals  SignalData
+}
+```
+
+`FSLoader` reads `locales/<lang>.json`, calls `flattenWithGrammar()`, which walks the raw JSON tree and extracts `gram.*` blocks into typed Go structs before flattening all other keys into a `map[string]Message`. This is why the `gram.*` structure must remain nested in JSON — the extractor relies on path prefixes (`gram.verb.*`, `gram.noun.*`, etc.) to route objects correctly. Flattening to dot-separated keys bypasses this routing and causes silent data loss.
+
+Grammar data is held in a package-level `grammarCache` (map protected by `sync.RWMutex`). `SetGrammarData(lang, data)` stores a loaded instance; `GetGrammarData(lang)` retrieves it.
+
+---
+
+## Reversal Engine
+
+The reversal package reads grammar tables backwards: given an inflected form, recover the base form and the grammatical role.
+
+### Tokeniser
+
+`Tokeniser` maintains inverse indexes built at construction time:
+
+| Index | Direction | Example |
+|-------|-----------|---------|
+| `pastToBase` | inflected past → base | `"deleted"` → `"delete"` |
+| `gerundToBase` | inflected gerund → base | `"deleting"` → `"delete"` |
+| `baseVerbs` | base verb set | `"delete"` → true |
+| `pluralToBase` | plural → singular | `"files"` → `"file"` |
+| `baseNouns` | base noun set | `"file"` → true |
+| `words` | word map (case-insensitive) | `"url"` → `"url"` |
+| `dualClass` | words in both verb and noun tables | `"commit"` → true |
+
+`buildVerbIndex()` and `buildNounIndex()` populate these indexes in two sub-tiers: first from the loaded `GrammarData` (JSON tables), then from the exported `IrregularVerbs()` and `IrregularNouns()` maps. JSON takes precedence; existing entries are not overwritten.
+
+**`MatchVerb(word string) (VerbMatch, bool)`**
+
+Three-tier lookup:
+
+1. Is the word in `baseVerbs`? Return tense `"base"`.
+2. Is the word in `pastToBase` or `gerundToBase`? Return the appropriate tense.
+3. Generate candidates via `reverseRegularPast()` or `reverseRegularGerund()` and round-trip verify each through the forward functions `PastTense()` / `Gerund()`. `bestRoundTrip()` resolves ambiguous candidates by preferring known base verbs, then VCe-ending words (the "magic e" pattern: `delete`, `create`, `use`), then words not ending in `e`.
+
+**`MatchNoun(word string) (NounMatch, bool)`**
+
+Same three-tier structure, using `reverseRegularPlural()` for tier 3.
+
+### Tokenisation Pipeline
+
+`Tokenise(text string) []Token` is a two-pass algorithm:
+
+**Pass 1 — Classify and mark**
+
+Each whitespace-separated token is first stripped of trailing punctuation (which becomes a separate `TokenPunctuation` token). The word portion is then checked in order: article → verb + noun combined check → word map. For dual-class base forms (a word that appears in both `baseVerbs` and `baseNouns` and has no inflection to self-resolve), the token is marked as `tokenAmbiguous` (an internal sentinel) with both `VerbInfo` and `NounInfo` stashed for Pass 2.
+
+Inflected forms self-resolve: `"committed"` always resolves as verb (past tense), `"commits"` as noun (plural), regardless of dual-class membership.
+
+**Pass 2 — Disambiguate**
+
+For each `tokenAmbiguous` token, `scoreAmbiguous()` evaluates seven weighted signals and `resolveToken()` converts scores to a classification with confidence values.
+
+| Signal | Weight | Description |
+|--------|--------|-------------|
+| `noun_determiner` | 0.35 | Preceding token is in the noun determiner list (articles, possessives, quantifiers) |
+| `verb_auxiliary` | 0.25 | Preceding token is a modal, auxiliary, or infinitive marker |
+| `following_class` | 0.15 | Next token is article/noun (→ verb signal) or verb (→ noun signal) |
+| `sentence_position` | 0.10 | Sentence-initial position signals imperative verb |
+| `verb_saturation` | 0.10 | A confident verb already exists in the same clause |
+| `inflection_echo` | 0.03 | Another token uses the same base in an inflected form |
+| `default_prior` | 0.02 | Always fires as verb signal (tiebreaker) |
+
+When total signal weight is below 0.10 (only the default prior fired), confidence is floored at 0.55/0.45 rather than deriving a misleading 1.0.
+
+The `verb_saturation` signal scans within clause boundaries only. Clause boundaries are defined as punctuation tokens and coordinating/subordinating conjunctions (`and`, `or`, `but`, `because`, `when`, `while`, `if`, `then`, `so`).
+
+Confidence values flow into imprints: dual-class tokens contribute to both verb and noun distributions weighted by `Confidence` and `AltConf`, preserving uncertainty for downstream scoring.
+
+**Token type**
+
+```go
+type Token struct {
+    Raw        string
+    Lower      string
+    Type       TokenType        // TokenVerb, TokenNoun, TokenArticle, TokenWord, TokenPunctuation, TokenUnknown
+    Confidence float64          // 0.0–1.0
+    AltType    TokenType        // Runner-up (dual-class only)
+    AltConf    float64
+    VerbInfo   VerbMatch
+    NounInfo   NounMatch
+    WordCat    string
+    ArtType    string
+    PunctType  string
+    Signals    *SignalBreakdown // Non-nil only with WithSignals() option
+}
+```
+
+**Options**
+
+`WithSignals()` allocates `SignalBreakdown` on ambiguous tokens, providing per-component scoring for diagnostic use. It adds approximately 36% latency and 3x allocations versus plain tokenise; keep it off in production paths.
+
+`WithWeights(map[string]float64)` overrides signal weights without code changes, useful for calibration experiments.
+
+**Benchmark baselines (M3 Ultra, arm64)**
+
+| Operation | ns/op | allocs |
+|-----------|-------|--------|
+| `Tokenise` (3 words) | 639 | 8 |
+| `Tokenise` (12 words) | 2859 | 14 |
+| `Tokenise` (dual-class) | 1657 | 9 |
+| `Tokenise` + `WithSignals` | 2255 | 28 |
+| `NewImprint` | 648 | 10 |
+| `Imprint.Similar` | 516 | 0 |
+| `Multiplier.Expand` | 3609 | 63 |
+
+Tokenise scales approximately linearly at 200–240 ns/word, giving approximately 350K sentences/second single-threaded.
+
+---
+
+## GrammarImprint
+
+`GrammarImprint` is a low-dimensional grammar feature vector computed from a token slice. It is a lossy projection: content is discarded, grammatical structure is preserved.
+
+```go
+type GrammarImprint struct {
+    VerbDistribution   map[string]float64 // verb base → normalised frequency
+    TenseDistribution  map[string]float64 // "past"/"gerund"/"base" → ratio
+    NounDistribution   map[string]float64 // noun base → normalised frequency
+    PluralRatio        float64            // proportion of plural nouns
+    DomainVocabulary   map[string]int     // gram.word category → hit count
+    ArticleUsage       map[string]float64 // "definite"/"indefinite" → ratio
+    PunctuationPattern map[string]float64 // "label"/"progress"/"question" → ratio
+    TokenCount         int
+    UniqueVerbs        int
+    UniqueNouns        int
+}
+```
+
+`NewImprint(tokens []Token) GrammarImprint` accumulates counts weighted by token confidence, then normalises all frequency maps to sum to 1.0 via L1 normalisation.
+
+**`Similar(b GrammarImprint) float64`**
+
+Returns weighted cosine similarity (0.0–1.0) between two imprints:
+
+| Component | Weight |
+|-----------|--------|
+| `VerbDistribution` | 0.30 |
+| `NounDistribution` | 0.25 |
+| `TenseDistribution` | 0.20 |
+| `ArticleUsage` | 0.15 |
+| `PunctuationPattern` | 0.10 |
+
+Components where both maps are empty are excluded from the weighted average (no signal contributed). `Similar` is zero-alloc (516 ns/op on M3 Ultra), making it suitable for high-volume comparison.
+
+---
+
+## Reference Distributions
+
+`BuildReferences(tokeniser, samples) (*ReferenceSet, error)` takes a slice of `ClassifiedText` (text + domain label), tokenises each, builds an imprint, and aggregates by domain.
+
+**Centroid computation**
+
+For each domain, `computeCentroid()` accumulates all imprint map fields using `addMap()`, then:
+- Normalises the accumulated maps via `normaliseMap()` (L1 norm, sums to 1.0)
+- Averages scalar fields (`PluralRatio`, `TokenCount`, `UniqueVerbs`, `UniqueNouns`)
+
+The result is a single centroid `GrammarImprint` representing the grammatical centre of mass for the domain.
+
+**Variance computation**
+
+`computeVariance()` computes sample variance for each key across all imprints in the domain. Keys are prefixed by component (`"verb:"`, `"tense:"`, `"noun:"`, `"article:"`, `"punct:"`) to form a flat variance map. Requires at least two samples; returns nil otherwise.
+
+**Distance metrics**
+
+`ReferenceSet.Compare(imprint)` computes three distance metrics between an imprint and each domain centroid:
+
+| Metric | Implementation | Notes |
+|--------|---------------|-------|
+| Cosine similarity | `Similar()` | 0.0–1.0, higher is closer |
+| KL divergence | Symmetric (Jensen-Shannon style) | 0.0+, lower is closer; epsilon-smoothed at 1e-10 |
+| Mahalanobis | Variance-normalised squared distance | Falls back to Euclidean (unit variance) when variance is unavailable |
+
+The same component weights (verb 0.30, noun 0.25, tense 0.20, article 0.15, punct 0.10) are applied to KL divergence and Mahalanobis computations.
+
+**Classification**
+
+`ReferenceSet.Classify(imprint) ImprintClassification` ranks domains by cosine similarity and returns the best match. Confidence is the margin between the best and second-best similarity scores (0.0 when there is only one domain).
+
+---
+
+## Anomaly Detection
+
+`ReferenceSet.DetectAnomalies(tokeniser, samples) ([]AnomalyResult, *AnomalyStats)` compares domain labels from an external classifier (e.g. a 1B language model) against imprint-based classification.
+
+For each sample:
+1. Tokenise the text
+2. Build a `GrammarImprint`
+3. Classify against reference centroids via `Classify()`
+4. Compare the model's domain label against the imprint's domain
+
+A sample is flagged as an anomaly when the two domains disagree. The aggregate `AnomalyStats` reports total count, anomaly count, rate, and a per-pair breakdown (`"technical->creative": 4`).
+
+Anomaly detection serves as a training signal: mislabelled samples from a 1B model can be flagged for human review or 27B verification. Validated behaviour: a creative sentence labelled as technical is correctly identified as an anomaly with a measurable confidence margin.
+
+---
+
+## 1B Pre-Sort Pipeline
+
+`ClassifyCorpus(ctx, model, input, output, opts...) (*ClassifyStats, error)` reads JSONL from `input`, batch-classifies each record through a `go-inference` `TextModel`, and writes JSONL with a `domain_1b` field added to `output`.
+
+Architecture:
+- Configurable via `WithBatchSize(n)`, `WithPromptField(field)`, `WithPromptTemplate(tmpl)`
+- Single-token generation (`WithMaxTokens(1)`) at temperature 0.05 for classification
+- `mapTokenToDomain(token)` maps model output to one of `{technical, creative, ethical, casual, unknown}` via exact match and known BPE fragment prefixes
+- Mock-friendly via the `inference.TextModel` interface
+
+Observed throughput: 80 prompts/second on M3 Ultra with Gemma3-1B (4-bit quantised), steady-state approaching 152 prompts/second as batch pipeline warms up.
+
+---
+
+## 1B vs 27B Calibration
+
+`CalibrateDomains(ctx, modelA, modelB, samples, opts...) (*CalibrationStats, error)` classifies a corpus with two models sequentially (A first, then B), then computes agreement and accuracy metrics.
+
+`CalibrationStats` includes:
+- Total and agreed counts, agreement rate
+- Per-model domain distribution
+- Confusion pairs in `"domainA->domainB"` format
+- Per-model accuracy against ground-truth labels (when provided)
+- Per-model classification duration
+
+The 500-sample integration corpus mixes 220 ground-truth sentences (55 per domain) with 280 unlabelled diverse sentences.
+
+---
+
+## Multiplier
+
+`Multiplier.Expand(text string) []string` generates deterministic grammatical variants for training data augmentation with zero API calls.
+
+For each verb in the tokenised text, it produces past, gerund, and base tense variants. For each noun, it toggles plural/singular. Combinations of verb transform and noun transform are also emitted. All variants are deduplicated. Case preservation (`preserveCase()`) maintains the capitalisation pattern of the original token — all-caps, title-case, and lower-case are all handled.
+
+---
+
+## Multi-Language Support
+
+The grammar engine is language-parametric. Every function that produces or classifies text uses the current language to look up `GrammarData`. Grammar tables must be loaded via `SetGrammarData(lang, data)` before use.
+
+**French grammar tables** (`locales/fr.json`) include:
+- 50 verb conjugations with `past` (passé composé participial form) and `gerund` (présent participe)
+- 24 gendered nouns with `"m"` or `"f"` gender fields
+- Gendered articles: `by_gender: {"m": "le", "f": "la"}`; indefinite: `"un"` (both vowel and default — French does not distinguish)
+- Punctuation: `label: " :"` (space before colon, per French typographic convention), `progress: "..."`
+
+**Tier 2 and tier 3 fallbacks are English-only.** The `irregularVerbs` and `irregularNouns` Go maps and the regular morphology rules apply English patterns. Non-English languages must therefore provide comprehensive `gram.verb` and `gram.noun` tables in JSON.
+
+**Disambiguation signal lists** in `gram.signal` are per-language. The tokeniser's `buildSignalIndex()` loads each list independently. If a list is absent or empty, the tokeniser falls back to hardcoded English defaults for that signal only.
+
+**Plural rules** for CLDR plural categories are registered per language code in the `pluralRules` map. Supported: en, de, fr, es, ru, pl, ar, zh, ja, ko (with regional variants).
+
+**RTL detection** is provided for Arabic, Hebrew, Persian, Urdu, and related codes via `rtlLanguages`.
+
+See `docs/grammar-table-spec.md` for the full JSON schema.
+
+---
+
+## Irreversibility and Round-Trip Property
+
+The reversal engine maintains the **round-trip property**: for any base form `b`, if `PastTense(b)` produces inflected form `f`, then `MatchVerb(f)` must recover `b`. This property is enforced by round-trip verification in tier 3 of all `Match*` functions: candidate bases are only accepted if the forward function reproduces the original inflected form. Tests in `reversal/roundtrip_test.go` validate this property for all irregular verbs and a sample of regular patterns.
+
+The imprint is inherently lossy — it discards lexical content and retains only grammatical structure. The round-trip property applies to the tokeniser, not to the imprint.
diff --git a/docs/development.md b/docs/development.md
new file mode 100644
index 0000000..601cc4c
--- /dev/null
+++ b/docs/development.md
@@ -0,0 +1,192 @@
+# Development Guide
+
+## Prerequisites
+
+- Go 1.25 or later (the module uses `go 1.25.5`)
+- `golang.org/x/text` (only external dependency for the core engine)
+- `forge.lthn.ai/core/go-inference` (replaced via local path `../go-inference` in `go.mod` — required for the `classify.go` and `calibrate.go` files and integration tests)
+
+For integration tests only:
+- Models on `/Volumes/Data/lem/` — specifically `LEM-Gemma3-1B-layered-v2` and `LEM-Gemma3-27B` (or compatible models served via the `go-inference` interface)
+
+The `go-inference` package provides the `TextModel` interface used by `ClassifyCorpus()` and `CalibrateDomains()`. Unit tests use a mock implementation and do not require real models.
+
+---
+
+## Build and Test
+
+```bash
+# Run all tests
+go test ./...
+
+# Run tests with verbose output
+go test -v ./...
+
+# Run tests for a specific package
+go test -v ./reversal/
+
+# Run a single test by name
+go test -run TestName ./...
+
+# Run benchmarks
+go test -bench=. ./...
+
+# Run benchmarks for a specific package
+go test -bench=. -benchmem ./reversal/
+
+# Run with race detector
+go test -race ./...
+```
+
+All tests must pass before committing. The race detector must report clean.
+
+---
+
+## Integration Tests
+
+Integration tests require real model instances on `/Volumes/Data/lem/` and are kept in the `integration/` directory, separate from unit tests. They are not run by `go test ./...` from the module root (the integration package is excluded from the default build tag set).
+
+```bash
+# 1B classification pipeline (50 prompts, approximately 1 second on M3 Ultra)
+cd integration && go test -v -run TestClassifyCorpus_Integration
+
+# 1B vs 27B calibration (500 sentences, approximately 2-5 minutes with 27B)
+cd integration && go test -v -run TestCalibrateDomains_1Bvs27B
+```
+
+If models are unavailable, the integration tests skip automatically via `testing.Short()` or an explicit model-presence check. Do not convert integration tests to unit tests — they have real runtime cost and external dependencies.
+
+---
+
+## Test Patterns
+
+### Unit tests
+
+Unit tests for the reversal package follow the `_Good`, `_Bad`, `_Ugly` naming pattern inherited from the broader Core Go ecosystem:
+
+- `_Good`: happy path
+- `_Bad`: expected error conditions
+- `_Ugly`: panic or edge cases
+
+Tests for the root package use standard Go test function naming.
+
+### Mock models
+
+`ClassifyCorpus()` and `CalibrateDomains()` accept the `inference.TextModel` interface. Unit tests construct a mock that returns controlled token sequences without loading any model. The mock implements `Classify(ctx, prompts, opts...) ([]Result, error)`.
+
+### Round-trip tests
+
+`reversal/roundtrip_test.go` validates the round-trip property: every verb in `irregularVerbs` and every noun in `irregularNouns` must survive a reverse lookup and recover the original base form. Add any new irregular entries to the maps in `types.go` and the round-trip tests will automatically cover them.
+
+### Disambiguation tests
+
+Nine named scenario tests cover the key disambiguation signal interactions:
+- Noun after determiner (noun_determiner fires)
+- Imperative verb at sentence start (sentence_position fires)
+- Verb saturation within clause
+- Clause boundary isolation
+- Contraction auxiliary (`don't`, `can't`, etc.)
+
+Twelve dual-class round-trip tests cover all six dual-class words (`commit`, `run`, `test`, `check`, `file`, `build`) in both verb and noun roles.
+
+### Benchmark baselines
+
+Benchmark baselines were measured on M3 Ultra, arm64. See `FINDINGS.md` (archived in `docs/history.md`) for the full table. When adding new benchmarks, include `b.ReportAllocs()` and compare against the baseline table.
+
+---
+
+## Coding Standards
+
+### Language
+
+UK English throughout. Correct spellings: `colour`, `organisation`, `centre`, `analyse`, `recognise`, `optimise`, `initialise`, `synchronise`, `cancelling`, `modelled`, `labelled`, `travelling`. These spellings appear in the `irregularVerbs` map and must remain consistent.
+
+### Go style
+
+- `declare(strict_types=1)` equivalent: use explicit types on all declarations where the type is not obvious from context
+- All parameters and return types must be named and typed
+- Prefer `fmt.Errorf("context: %w", err)` for error wrapping
+- Use `errors.Is()` for error comparison, not string matching
+- No global mutable state beyond the `grammarCache` and `templateCache` (which are already protected by synchronisation primitives)
+
+### Grammar table rules
+
+**Never flatten `gram.*` keys in locale JSON.** The loader (`flattenWithGrammar()`) depends on the nested `gram.verb.*`, `gram.noun.*` etc. path structure to route objects into typed Go structs. Flattening to `"gram.verb.delete.past": "deleted"` causes silent data loss — the key is treated as a plain translation message, not a verb form.
+
+**Dual-class words** must appear in both `gram.verb` and `gram.noun` in the JSON. The tokeniser builds the `dualClass` index by intersecting `baseVerbs` and `baseNouns` at construction time.
+
+**Only `gram.*` grammar data belongs in `locales/en.json` and `locales/fr.json`.** Consumer app translation keys (`prompt.*`, `time.*`, etc.) are managed by consumers, not this library.
+
+### File organisation
+
+| File | Contents |
+|------|----------|
+| `types.go` | All types, interfaces, package variables, irregular maps |
+| `grammar.go` | Forward composition functions |
+| `loader.go` | FSLoader, JSON parsing, flattenWithGrammar |
+| `classify.go` | ClassifyCorpus, ClassifyStats, ClassifyOption |
+| `calibrate.go` | CalibrateDomains, CalibrationStats, CalibrationResult |
+| `reversal/tokeniser.go` | Tokeniser, Tokenise, two-pass disambiguation |
+| `reversal/imprint.go` | GrammarImprint, NewImprint, Similar |
+| `reversal/reference.go` | ReferenceSet, BuildReferences, Compare, Classify, distance metrics |
+| `reversal/anomaly.go` | DetectAnomalies, AnomalyResult, AnomalyStats |
+| `reversal/multiplier.go` | Multiplier, Expand |
+
+Do not put grammar functions in `types.go` or type definitions in `grammar.go`. Keep the split clean.
+
+---
+
+## Conventional Commits
+
+Format: `type(scope): description`
+
+Common types: `feat`, `fix`, `test`, `bench`, `refactor`, `docs`, `chore`
+
+Common scopes: `tokeniser`, `imprint`, `reference`, `anomaly`, `multiplier`, `grammar`, `loader`, `classify`, `calibrate`, `fr` (for French grammar table changes)
+
+Examples:
+```
+feat(tokeniser): add two-pass disambiguation for dual-class words
+fix(imprint): floor confidence at 0.55/0.45 when only prior fires
+test(reference): add Mahalanobis fallback to Euclidean test
+bench(grammar): add PastTense and Gerund baselines
+```
+
+---
+
+## Co-Author
+
+All commits must include the co-author trailer:
+
+```
+Co-Authored-By: Virgil <virgil@lethean.io>
+```
+
+---
+
+## Licence
+
+EUPL-1.2. Do not add dependencies with incompatible licences. The only external runtime dependency is `golang.org/x/text` (BSD-3-Clause, compatible). `go-inference` is an internal Core module.
+
+---
+
+## Adding a New Language
+
+1. Create `locales/<lang>.json` with a complete `gram` block following `docs/grammar-table-spec.md`.
+2. Populate `gram.verb` comprehensively — tiers 2 and 3 of the fallback chain are English-only.
+3. Populate `gram.noun` with gender fields if the language has grammatical gender.
+4. Set `gram.article.by_gender` for gendered article systems.
+5. Set `gram.punct.label` correctly — French uses `" :"` (space before colon), English uses `":"`.
+6. Populate `gram.signal` lists so the disambiguation tokeniser has language-appropriate determiners and auxiliaries. Without these, the tokeniser uses hardcoded English defaults.
+7. Add a plural rule function to the `pluralRules` map in `types.go` if the language has non-standard plural categories (beyond one/other).
+8. Run `go test ./...` and confirm all existing tests still pass. Add grammar data tests that verify the loaded counts and known values.
+9. If the language needs reversal support, verify that `NewTokeniserForLang("<lang>")` builds indexes correctly and that `MatchVerb` / `MatchNoun` return correct results for a sample of forms.
+
+---
+
+## Performance Notes
+
+- `Imprint.Similar` is zero-alloc. Keep it that way — it is called in tight loops during reference comparison.
+- `WithSignals()` allocates `SignalBreakdown` on every ambiguous token. It is for diagnostics only; never enable it in the hot path.
+- `Multiplier.Expand` allocates heavily (63 allocs for a four-word sentence). If it becomes a bottleneck, pool the token slices.
+- The `grammarCache` uses `sync.RWMutex` with read-biased locking. Languages are loaded once at startup and then read-only; this is the intended pattern.
diff --git a/docs/history.md b/docs/history.md
new file mode 100644
index 0000000..3cd0f53
--- /dev/null
+++ b/docs/history.md
@@ -0,0 +1,238 @@
+# Project History
+
+## Initial Assessment (2026-02-19)
+
+**State at assessment**: 5,800 lines across 32 files (14 test files). All tests passing. One external dependency (`golang.org/x/text`). Grammar engine solid: forward composition, reversal, imprint, and multiplier all functional.
+
+**Key gaps identified**:
+
+| Gap | Impact |
+|-----|--------|
+| No CLAUDE.md — agents would flatten locale files | High |
+| Dual-class word ambiguity (e.g. "file" as verb vs noun) | Medium |
+| No benchmark baselines for hot-path usage | Medium |
+| No reference distributions for imprint calibration | High |
+| English-only grammar tables | Medium |
+
+---
+
+## Phase 1: Engine Hardening
+
+**Commit d5b3eac** — Added CLAUDE.md with the critical rule against flattening `gram.*` locale JSON. Established the grammar engine's identity as a primitive provider, not a translation file manager.
+
+**Commit 3848297** — Two-pass probabilistic disambiguation for dual-class words.
+
+Words in both `gram.verb` and `gram.noun`: `{commit, run, test, check, file, build}`. Previously resolved verb-first without context. Now uses a two-pass algorithm with seven weighted signals.
+
+Pass 1 classifies unambiguous tokens and marks base forms of dual-class words as `tokenAmbiguous`. Inflected forms self-resolve (e.g. `"committed"` → verb past, `"commits"` → noun plural).
+
+Pass 2 evaluates signals:
+
+| Signal | Weight |
+|--------|--------|
+| `noun_determiner` | 0.35 |
+| `verb_auxiliary` | 0.25 |
+| `following_class` | 0.15 |
+| `sentence_position` | 0.10 |
+| `verb_saturation` | 0.10 |
+| `inflection_echo` | 0.03 |
+| `default_prior` | 0.02 |
+
+Design decisions recorded during implementation:
+- Confidence floor of 0.55/0.45 when only the default prior fires (total < 0.10), preventing misleading 1.0 confidence from a single weak signal (fix B3)
+- Contractions (`don't`, `can't`, `won't`, etc.) added to `verb_auxiliary` signal list (fix D1)
+- Clause boundary isolation for `verb_saturation` — scans within punctuation and coordinating conjunctions only (fix D2)
+- `WithWeights()` option for configurable signal weights without code changes (fix F3)
+- `DisambiguationStats` for aggregate Phase 2 calibration (fix F1)
+- `WithSignals()` opt-in for per-token signal diagnostics (kept out of hot path)
+- `buildSignalIndex()` guards each signal list independently, allowing partial locale data to fall back per-field (fix R3)
+- Removed `"passed"`, `"failed"`, `"skipped"` from `gram.noun` and `gram.word` — these are past participles, not nouns (fix R1)
+
+Test coverage for this commit: 9 disambiguation scenario tests, 12 dual-class round-trip tests, imprint convergence test, `DisambiguationStats` tests, `WithWeights` override test. Race detector clean.
+
+**Same session** — Extended irregular verb coverage.
+
+Added 44 irregular verbs:
+- 17 compound irregular (prefix + base): `undo`, `redo`, `rerun`, `rewrite`, `rebuild`, `resend`, `override`, `rethink`, `remake`, `undergo`, `overcome`, `withdraw`, `uphold`, `withhold`, `outgrow`, `outrun`, `overshoot`
+- 22 simple irregular (dev/ops): `become`, `come`, `give`, `fall`, `understand`, `arise`, `bind`, `spin`, `quit`, `cast`, `broadcast`, `burst`, `cost`, `shed`, `rid`, `shrink`, `shoot`, `forbid`, `offset`, `upset`, `input`, `output`
+- 5 CVC doubling overrides: `debug`, `embed`, `unzip`, `remap`, `unpin`, `unwrap` — words with stressed final syllable that `shouldDoubleConsonant()` misses because they exceed four characters
+
+Total irregular verb count: approximately 140 (from approximately 96).
+
+**Same session** — Added benchmarks.
+
+8 forward composition benchmarks, 7 reversal benchmarks. Baselines on M3 Ultra (arm64):
+
+Forward composition:
+
+| Benchmark | ns/op | allocs/op |
+|-----------|-------|-----------|
+| PastTense (irregular) | 25.67 | 0 |
+| PastTense (regular) | 48.52 | 1 |
+| PastTense (compound) | 26.15 | 0 |
+| Gerund | 25.87 | 0 |
+| Pluralize | 67.97 | 1 |
+| Article | 177.4 | 0 |
+| Progress | 107.1 | 2 |
+| ActionResult | 115.3 | 3 |
+
+Reversal engine:
+
+| Benchmark | ns/op | allocs/op |
+|-----------|-------|-----------|
+| Tokenise (3 words) | 639 | 8 |
+| Tokenise (12 words) | 2859 | 14 |
+| Tokenise (dual-class) | 1657 | 9 |
+| Tokenise (WithSignals) | 2255 | 28 |
+| NewImprint | 648 | 10 |
+| Imprint.Similar | 516 | 0 |
+| Multiplier.Expand | 3609 | 63 |
+
+Key observations:
+- `Similar` is zero-alloc at 516 ns/op — hot-path safe for high-volume imprint comparison
+- Tokenise scales linearly at approximately 200–240 ns/word
+- `WithSignals` adds 36% latency and 3x allocs — keep opt-in
+
+---
+
+## Phase 2a: 1B Pre-Classification
+
+**Classification benchmark results** (220 domain-tagged sentences, 55 per domain, leave-one-out imprint similarity):
+
+| Domain | Accuracy | Tense signature |
+|--------|----------|-----------------|
+| Technical | 78.2% | base 46%, gerund 30%, past 24% |
+| Creative | 81.8% | past 80%, gerund 16%, base 4% |
+| Ethical | 45.5% | base 55%, past 25%, gerund 20% |
+| Casual | 10.9% | past 70%, base 17%, gerund 14% |
+
+Overall: 54.1% (versus 25% random chance).
+
+Confusion axes:
+- Ethical → Technical: both use base-form verbs heavily (prescriptive vs imperative register share the same grammar profile)
+- Casual → Creative: both use past tense heavily (anecdotal vs narrative register share the same grammar profile)
+
+Conclusion: grammar-based classification is a strong first pass for technical (78%) and creative (82%). The 1B model is specifically needed for the ethical/technical and casual/creative axes.
+
+**LEK-Gemma3-1B-v2 benchmark** (M3 Ultra, temp=0.05):
+- Domain classification: 75% across three evaluation rounds, consistent
+- Article correctness T/F: 100% (three cases)
+- Irregular base forms A/B: 100% (two cases)
+- Dead zones: pattern fill (0%), tense detection (50%), generative output (unreliable)
+
+At 0.17s per classification, a single M3 can pre-sort approximately 5,000 sentences per second. The 88K Phase 0 seed corpus would take approximately 15–18 seconds.
+
+**`ClassifyCorpus()` added to `classify.go`**
+
+Streaming JSONL input → batch classification via `inference.TextModel` → JSONL output with `domain_1b` field. Configurable batch size, prompt field, prompt template. Mock-testable via `inference.TextModel` interface.
+
+Integration test results: 50 prompts classified in 625ms (80 prompts/second), all 50 technical prompts correctly labelled as `"technical"`.
+
+**`CalibrateDomains()` added to `calibrate.go`** — commit 3b7ef9d
+
+Accepts two `TextModel` instances (model A = 1B, model B = 27B), classifies the full corpus sequentially with each model (A then B, to manage memory for large models), and computes agreement metrics, confusion pairs, and accuracy against ground truth.
+
+Integration corpus: 500 samples (220 ground-truth + 280 unlabelled). Soft assertion: agreement rate greater than 50%.
+
+**Virgil review fixes applied**: go.mod cleanup, prefix collision fix in `mapTokenToDomain()`, short-mode skip in integration tests, accuracy assertion on 5 items minimum.
+
+**Article/irregular validator** — single-token classification via `m.Generate()` at temp=0.05 for use as a lightweight grammar validator in the forward composition path.
+
+---
+
+## Phase 2b: Reference Distributions
+
+**Commit c3e9153** — Reference distribution builder, imprint comparator, anomaly detection.
+
+**`BuildReferences()` in `reversal/reference.go`**
+
+Tokenises classified samples, builds imprints, groups by domain, computes centroid and per-key variance for each domain. Centroid is computed by accumulating all map fields then normalising (L1 norm). Variance is sample variance, prefixed by component.
+
+**`Compare()` and `Classify()`**
+
+Three distance metrics between a query imprint and each reference centroid:
+- Cosine similarity via `Similar()` (weighted, same component weights as imprint comparison)
+- Symmetric KL divergence (Jensen-Shannon style, epsilon-smoothed at 1e-10)
+- Mahalanobis distance (variance-normalised Euclidean, falls back to unit variance when variance map is absent)
+
+`Classify()` ranks by cosine similarity and returns the best domain plus the confidence margin (gap between 1st and 2nd similarity scores).
+
+**`DetectAnomalies()` in `reversal/anomaly.go`**
+
+Compares model-assigned domain labels against imprint-based classification. Flags mismatches. Returns per-sample `AnomalyResult` and aggregate `AnomalyStats`.
+
+Validated: a creative sentence tagged as technical by the model is correctly flagged as an anomaly with confidence 0.37.
+
+Key findings from this phase:
+- Grammar alone separates technical from creative (cosine similarity gap of 0.21)
+- Ethical/technical and casual/creative overlap persists — confirms the 1B model is required for those axes
+- Anomaly detection is a training signal for human review or 27B spot-checking
+- `ReferenceSet` API is the bridge to the Poindexter trust verification pipeline
+
+Test coverage: technical vs creative centroid distance test, creative-mislabelled-as-technical anomaly test, single-domain degenerate case, KL identity test (symmetric KL of identical distributions ≈ 0.0), Mahalanobis unit-variance fallback test.
+
+---
+
+## Phase 3: Multi-Language
+
+**Grammar table format specification** — `docs/grammar-table-spec.md`
+
+Full JSON schema documenting all `gram.*` sections: verb, noun, article, word, punct, signal, number. Includes detection rules (how the loader identifies verb vs noun vs plural objects), fallback chain documentation, dual-class word guidance, and step-by-step instructions for adding a new language.
+
+**French grammar tables** — `locales/fr.json`
+
+- 50 verb conjugations (passé composé participle form for `past`, présent participe for `gerund`)
+- 24 gendered nouns with `"m"` or `"f"` gender
+- Gendered articles: `by_gender: {"m": "le", "f": "la"}`, indefinite `"un"` (vowel and default identical in French)
+- Punctuation: `label: " :"` (French typographic convention: space before colon)
+- Full `gram.signal` lists with French determiners and auxiliaries
+
+---
+
+## Known Limitations
+
+**Grammar-based classification ceiling**
+
+At 54.1% overall accuracy, the imprint alone cannot distinguish ethical from technical or casual from creative. These axes require semantic understanding that grammar features cannot provide. The 1B model addresses this, but with its own ceiling at 75% domain accuracy.
+
+**Tier 2 and tier 3 are English-only**
+
+The `irregularVerbs` and `irregularNouns` Go maps and the regular morphology rules encode English patterns. For French, German, Spanish, or other languages, all irregular forms must be in the JSON grammar tables. A French word not in `locales/fr.json` will fall through to the English irregular maps (unlikely to match) and then to English morphology rules (will produce wrong output).
+
+**French reversal**
+
+Elision (`l'`) and plural articles (`les`, `des`) are not handled by the current `Article()` function or the reversal tokeniser. The `by_gender` article map supports gendered articles for composition, but the reversal tokeniser's `MatchArticle()` only checks `IndefiniteDefault`, `IndefiniteVowel`, and `Definite`. French reversal is therefore incomplete.
+
+**Dual-class expansion candidates not yet measured**
+
+Twenty additional words are candidates for the dual-class set: `patch`, `release`, `update`, `change`, `merge`, `push`, `pull`, `tag`, `log`, `watch`, `link`, `host`, `import`, `export`, `process`, `function`, `handle`, `trigger`, `stream`, `queue`. The decision to add any of them should be based on measured imprint drift in the 88K seed corpus rather than intuition.
+
+**Multiplier allocation cost**
+
+`Multiplier.Expand` allocates 63 objects for a four-word sentence. This is acceptable at current usage volumes but would become a bottleneck if called at high frequency. Token slice pooling is the obvious mitigation.
+
+**88K seed corpus not yet processed**
+
+The Phase 0 LEM seed corpus (88K sentences) has not been run through the pre-sort pipeline or used to build reference distributions. The current reference distributions and classification benchmarks are based on 220–500 manually curated sentences.
+
+---
+
+## Future Considerations
+
+These are not tasked. They represent the natural next work given what has been built.
+
+**Expanded dual-class words**
+
+Measure imprint drift on the 88K seeds for the 20 candidate words listed above. Add only those that show statistically meaningful drift between verb and noun roles — adding words that do not cause imprint changes has no benefit and increases disambiguation overhead.
+
+**French reversal**
+
+Extend `Article()` to handle elision (`l'` before vowel-initial nouns) and plural forms (`les`, `des`). Update `MatchArticle()` in the reversal tokeniser to recognise the full French article set including gendered and plural variants.
+
+**88K seed corpus processing**
+
+Run `ClassifyCorpus()` against the full Phase 0 seed corpus to produce domain-tagged JSONL. Use the output to call `BuildReferences()` and produce reference distributions grounded in real data rather than the 220-sentence hand-curated set. This would make anomaly detection and imprint-based classification significantly more reliable.
+
+**Corpus-derived word priors**
+
+`SignalData.Priors` (`map[string]map[string]float64`) is reserved for per-word priors derived from corpus frequencies. Currently unused. A corpus-derived prior would allow, for example, `"run"` to carry a higher verb prior in technical contexts based on observed frequencies rather than the fixed 0.02 default prior.