diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..78def92 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,67 @@ +# CLAUDE.md + +## What This Is + +Grammar-aware internationalisation engine for Go. Module: `forge.lthn.ai/core/go-i18n` + +This is a **grammar engine** — it provides primitives for composing and reversing grammatically correct text. It is NOT a translation file manager. Consumers bring their own translations. + +## Commands + +```bash +go test ./... # Run all tests +go test -v ./reversal/ # Reversal engine tests +go test -bench=. ./... # Benchmarks (when added) +``` + +## Critical Rules + +### DO NOT flatten locale JSON files + +The grammar engine depends on nested `gram.*` structure: + +```json +{ + "gram": { + "verb": { + "delete": { "past": "deleted", "gerund": "deleting" } + } + } +} +``` + +If you flatten this to `"gram.verb.delete.past": "deleted"`, the grammar engine breaks silently. **This is the #1 cause of agent-introduced bugs.** + +### This library does not manage consumer translations + +go-i18n provides grammar primitives. Apps using it (core/cli, etc.) manage their own translation files. Do not add app-specific translation keys to `locales/en.json` — only `gram.*` grammar data belongs there. + +## Architecture + +| Package | Purpose | +|---------|---------| +| Root | Forward composition: T(), grammar primitives, handlers, service | +| `reversal/` | Reverse grammar: tokeniser, imprint, multiplier | +| `locales/` | Grammar tables (JSON) — only `gram.*` data | +| `docs/plans/` | Design documents | + +## Key API + +- `T(key, args...)` — Translate with namespace handlers +- `PastTense(verb)`, `Gerund(verb)`, `Pluralize(noun, n)`, `Article(word)` — Grammar primitives +- `reversal.NewTokeniser().Tokenise(text)` — Reverse grammar lookup +- `reversal.NewImprint(tokens)` — Feature vector projection +- `reversal.NewMultiplier().Expand(text)` — Training data augmentation + +## Coding Standards + +- UK English (colour, organisation, centre) +- `go test ./...` must pass before commit +- Conventional commits: `type(scope): description` +- Co-Author: `Co-Authored-By: Virgil ` + +## Task Queue + +See `TODO.md` for dispatched tasks from core/go orchestration. +See `FINDINGS.md` for research notes and architectural decisions. +See the [wiki](https://forge.lthn.ai/core/go-i18n/wiki) for full architecture docs. diff --git a/FINDINGS.md b/FINDINGS.md new file mode 100644 index 0000000..fb313ad --- /dev/null +++ b/FINDINGS.md @@ -0,0 +1,40 @@ +# FINDINGS.md — go-i18n Research & Discovery + +Record findings, gaps, and architectural decisions here as work progresses. + +--- + +## 2026-02-19: Initial Assessment (Virgil) + +### Current State + +- 5,800 lines across 32 files (14 test files) +- All tests pass +- Only dependency: `golang.org/x/text` +- Grammar engine is solid: forward composition + reversal + imprint + multiplier + +### Architecture + +go-i18n is a **grammar engine**, not a translation file manager. Consumers bring their own translations. The library provides: + +1. **Forward composition** — Grammar primitives that compose grammatically correct text +2. **Reverse grammar** — Tokeniser reads grammar tables backwards to extract structure +3. **GrammarImprint** — Lossy feature vector projection (content to grammar fingerprint) +4. **Multiplier** — Deterministic training data augmentation (no LLM calls) + +### Key Gaps Identified + +| Gap | Impact | Notes | +|-----|--------|-------| +| No CLAUDE.md | High | Agents don't know the rules, will flatten locale files | +| Dual-class word ambiguity | Medium | "file" as verb vs noun, "run" as verb vs noun | +| No benchmarks | Medium | No perf baselines for hot-path usage (TIM, Poindexter) | +| No reference distributions | High | Can't calibrate imprints without scored seed data | +| Only English grammar tables | Medium | Reversal only works with loaded GrammarData | + +### Sacred Rules + +- `gram.*` keys in locale JSON MUST remain nested — flattening breaks the grammar engine +- Irregular forms in grammar tables override regular morphological rules +- Round-trip property must hold: forward(base) then reverse must recover base +- go-i18n does NOT ship or manage consumer translation files diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..d38ea36 --- /dev/null +++ b/TODO.md @@ -0,0 +1,31 @@ +# TODO.md — go-i18n Task Queue + +Dispatched from core/go orchestration. Pick up tasks in order. + +--- + +## Phase 1: Harden the Engine + +- [ ] **Add CLAUDE.md** — Document the grammar engine contract: what it is (grammar primitives + reversal), what it isn't (translation file manager). Include build/test commands, the gram.* sacred rule, and the agent-flattening prohibition. +- [ ] **Ambiguity resolution for dual-class words** — Words like "run", "file", "test", "check", "build" are both verb and noun. Tokeniser currently picks first match. Need context-aware disambiguation (look at surrounding tokens: article before = noun, after subject = verb). +- [ ] **Extend irregular verb coverage** — Audit against common dev/ops vocabulary. Missing forms cause silent fallback to regular rules which may produce wrong output (e.g. "builded" instead of "built"). +- [ ] **Add benchmarks** — `grammar_test.go` and `reversal/tokeniser_test.go` need `Benchmark*` functions. The engine will run in hot paths (TIM, Poindexter) — need baseline numbers. + +## Phase 2: Reference Distribution + +- [ ] **Reference distribution builder** — Process the 88K scored seeds from LEM Phase 0 through the tokeniser + imprint pipeline. Output: per-category (ethical, technical, harmful) reference distributions stored as JSON. This calibrates what "normal" grammar looks like. +- [ ] **Imprint comparator** — Given a new text and reference distributions, compute distance metrics (cosine, KL divergence, Mahalanobis). Return a classification signal with confidence score. This is the Poindexter integration point. + +## Phase 3: Multi-Language + +- [ ] **Grammar table format spec** — Document the exact JSON schema for `gram.*` keys so new languages can be added. Currently only inferred from `en.json`. +- [ ] **French grammar tables** — First non-English language. French has gendered nouns, complex verb conjugation, elision rules. Good stress test for the grammar engine's language-agnostic design. + +--- + +## Workflow + +1. Virgil in core/go writes tasks here after research +2. This repo's session picks up tasks in phase order +3. Mark `[x]` when done, note commit hash +4. New discoveries → add tasks, flag in FINDINGS.md