go/TODO.md

# TODO.md — Core Go Dispatch Queue

Tasks dispatched from core/go orchestration to satellite repos.
Format: `- [ ] REPO: task description` / `- [x]` when done.

---

## go-i18n (forge.lthn.ai/core/go-i18n)

### Phase 1: Harden the Engine

- [ ] **go-i18n: Add CLAUDE.md** — Document the grammar engine contract: what it is (grammar primitives + reversal), what it isn't (translation file manager). Include build/test commands, the gram.* sacred rule, and the agent-flattening prohibition.
- [ ] **go-i18n: Ambiguity resolution for dual-class words** — Words like "run", "file", "test", "check", "build" are both verb and noun. Tokeniser currently picks first match. Need context-aware disambiguation (look at surrounding tokens: article before → noun, after subject → verb).
- [ ] **go-i18n: Extend irregular verb coverage** — Audit against common dev/ops vocabulary. Missing forms cause silent fallback to regular rules which may produce wrong output (e.g. "builded" instead of "built").
- [ ] **go-i18n: Add benchmarks** — `grammar_test.go` and `reversal/tokeniser_test.go` need `Benchmark*` functions. The engine will run in hot paths (TIM, Poindexter) — need baseline numbers.

### Phase 2: Reference Distribution + 1B Classification Pipeline

#### 2a: 1B Pre-Classification (based on LEK-1B benchmarks)

- [ ] **go-i18n: Classification benchmark suite** — `classify_bench_test.go` with 200+ domain-tagged sentences. Categories: {technical, creative, ethical, casual}. Ground truth for calibrating 1B pre-tags.
- [ ] **go-i18n: 1B pre-sort pipeline tool** — CLI/func that reads JSONL corpus, classifies via LEK-Gemma3-1B, writes back with `domain_1b` field. Target: ~5K sentences/sec on M3.
- [ ] **go-i18n: 1B vs 27B calibration check** — Sample 500 sentences, classify with both, measure agreement. 75% baseline from benchmarks, technical↔creative is known weak spot.
- [x] **go-i18n: Article/irregular validator** — `validate.go` + `validate_test.go` (14 tests). `ValidateArticle()`, `ValidateIrregular()`, batch variants. Commit `3c55d91`.

#### 2b: Reference Distributions

- [ ] **go-i18n: Reference distribution builder** — Process 88K scored seeds through tokeniser + imprint. Pre-sort by `domain_1b` tag. Output: per-category reference distributions as JSON.
- [ ] **go-i18n: Imprint comparator** — Distance metrics (cosine, KL divergence, Mahalanobis) against reference distributions. Classification signal with confidence. Poindexter integration point.
- [ ] **go-i18n: Cross-domain anomaly detection** — Flag texts where 1B domain tag disagrees with imprint classification. Training signal or genuine cross-domain text — both valuable.

### Phase 3: Multi-Language

- [ ] **go-i18n: Grammar table format spec** — Document the exact JSON schema for `gram.*` keys so new languages can be added. Currently only inferred from `en.json`.
- [ ] **go-i18n: French grammar tables** — First non-English language. French has gendered nouns, complex verb conjugation, elision rules. Good stress test for the grammar engine's language-agnostic design.

---

## core/go (this repo)

- [ ] **core/go: Remove pkg/i18n dependency** — Once core/cli imports go-i18n directly, remove `pkg/i18n/` from this repo. The locale files are bad data and shouldn't be migrated.
- [ ] **core/go: Update go.work** — Add go-i18n to the workspace for local development (`go work use ../go-i18n`).

---

## Workflow

1. Virgil (this session) writes tasks above after research
2. Second GoLand session opens the target repo and works from this TODO
3. When a task is done, mark `[x]` and note the commit/PR
4. If a task reveals new work, add it here
5. Scale to other repos once pattern is proven