Document full JSON schema for gram.* keys in docs/grammar-table-spec.md. Add French grammar tables (50 verbs, 24 gendered nouns, signals). Extend loader to parse article by_gender map. Completes Phase 3. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8.7 KiB
Grammar Table Format Specification
JSON schema for gram.* keys in go-i18n locale files. This document defines the contract new languages must implement to work with the grammar engine.
File Location
Locale files live in locales/<lang>.json and are embedded at compile time. The language code follows BCP 47 (e.g. en, fr, de, es, zh).
Top-Level Structure
{
"gram": {
"verb": { ... },
"noun": { ... },
"article": { ... },
"word": { ... },
"punct": { ... },
"signal": { ... },
"number": { ... }
},
"prompt": { ... },
"time": { ... },
"lang": { ... }
}
Only the gram block is documented here. The prompt, time, and lang blocks are standard translation keys that consumers manage.
Critical: The gram.* structure MUST remain nested. The loader extracts grammar data before flattening. If you flatten gram.verb.delete.past to a dot-separated key, the grammar engine breaks silently.
gram.verb — Verb Conjugations
Maps verb base forms to their inflected forms. Used by PastTense(), Gerund(), and the reversal tokeniser.
"verb": {
"<base>": {
"base": "<string>",
"past": "<string>",
"gerund": "<string>"
}
}
| Field | Required | Description |
|---|---|---|
base |
Optional | The infinitive form. Defaults to the key name. |
past |
Required | Simple past tense form. |
gerund |
Required | Present participle (-ing form). |
Example (English):
"delete": { "base": "delete", "past": "deleted", "gerund": "deleting" },
"build": { "base": "build", "past": "built", "gerund": "building" }
Example (French) — would extend to include mood/person:
"supprimer": { "base": "supprimer", "past": "supprimé", "gerund": "supprimant" }
Detection: The loader identifies verb objects by the presence of "past" or "gerund" keys (isVerbFormObject()).
Fallback chain: JSON gram.verb → Go irregularVerbs map → regular morphology rules.
gram.noun — Noun Plural Forms
Maps noun base forms to their singular/plural forms. Used by Pluralize(), PluralForm(), and the reversal tokeniser.
"noun": {
"<base>": {
"one": "<string>",
"other": "<string>",
"gender": "<string>"
}
}
| Field | Required | Description |
|---|---|---|
one |
Required | Singular form. |
other |
Required | Plural form. |
gender |
Optional | Grammatical gender ("m", "f", "n"). Unused in English. |
Example (English):
"file": { "one": "file", "other": "files" },
"person": { "one": "person", "other": "people" }
Example (French) — gender required for article agreement:
"fichier": { "one": "fichier", "other": "fichiers", "gender": "m" },
"branche": { "one": "branche", "other": "branches", "gender": "f" }
Detection: The loader identifies noun objects by the presence of "one" and "other" keys.
Fallback chain: JSON gram.noun → Go irregularNouns map → regular pluralisation rules.
gram.article — Article Rules
Defines article forms. Used by Article() and the reversal tokeniser.
"article": {
"indefinite": {
"default": "<string>",
"vowel": "<string>"
},
"definite": "<string>",
"by_gender": {
"<gender>": "<string>"
}
}
| Field | Required | Description |
|---|---|---|
indefinite.default |
Required | Default indefinite article ("a" in English). |
indefinite.vowel |
Required | Vowel-sound indefinite article ("an" in English). |
definite |
Required | Definite article ("the" in English). |
by_gender |
Optional | Gender-specific articles (French: {"m": "le", "f": "la"}). |
Example (English):
"article": {
"indefinite": { "default": "a", "vowel": "an" },
"definite": "the"
}
Example (French):
"article": {
"indefinite": { "default": "un", "vowel": "un" },
"definite": "le",
"by_gender": { "m": "le", "f": "la" }
}
gram.word — Domain Vocabulary
Maps lookup keys to their display forms. Used for acronyms, proper nouns, and multi-word terms that need exact casing or spacing. The reversal tokeniser classifies these as TokenWord.
"word": {
"<key>": "<display_form>"
}
Keys are lowercase with underscores for multi-word terms. Values are the exact display string.
Example:
"url": "URL",
"api": "API",
"go_mod": "go.mod",
"dry_run": "dry run",
"up_to_date": "up to date"
Note: Do not add app-specific terms here. Only grammar-level vocabulary (acronyms, proper nouns) belongs in gram.word. Consumer apps manage their own translation keys outside gram.*.
gram.punct — Punctuation Rules
Language-specific punctuation patterns used by Label() and Progress().
"punct": {
"label": "<string>",
"progress": "<string>"
}
| Field | Required | Description |
|---|---|---|
label |
Required | Suffix for label formatting. English: ":", French: " :". |
progress |
Required | Suffix for progress indicators. Typically "...". |
Example (English):
"punct": { "label": ":", "progress": "..." }
Example (French) — space before colon:
"punct": { "label": " :", "progress": "..." }
gram.signal — Disambiguation Signals
Word lists used by the reversal tokeniser's two-pass disambiguation algorithm. These resolve dual-class words (e.g. "commit" as verb vs noun) based on surrounding context.
"signal": {
"noun_determiner": ["<string>", ...],
"verb_auxiliary": ["<string>", ...],
"verb_infinitive": ["<string>", ...]
}
| Field | Required | Description |
|---|---|---|
noun_determiner |
Recommended | Words that signal a following noun (articles, possessives, quantifiers). |
verb_auxiliary |
Recommended | Words that signal a following verb (modals, auxiliaries, contractions). |
verb_infinitive |
Recommended | Infinitive markers ("to" in English). |
Example (English):
"signal": {
"noun_determiner": ["the", "a", "an", "this", "that", "my", "your", ...],
"verb_auxiliary": ["is", "are", "will", "can", "must", "don't", "can't", ...],
"verb_infinitive": ["to"]
}
Fallback: If a signal list is missing or empty, the tokeniser uses hardcoded English defaults. Each signal list falls back independently — partial locale data is handled gracefully.
gram.number — Number Formatting
Locale-specific number formatting rules.
"number": {
"thousands": "<string>",
"decimal": "<string>",
"percent": "<string>"
}
| Field | Required | Description |
|---|---|---|
thousands |
Required | Thousands separator. English: ",", German: ".". |
decimal |
Required | Decimal separator. English: ".", German: ",". |
percent |
Required | Printf format string. Use %s%% for "42%". |
Example (English):
"number": { "thousands": ",", "decimal": ".", "percent": "%s%%" }
Example (German):
"number": { "thousands": ".", "decimal": ",", "percent": "%s %%" }
Adding a New Language
- Create
locales/<lang>.jsonwith the fullgramblock. - All
gram.verbentries needpastandgerundfields at minimum. - All
gram.nounentries needoneandotherfields. Addgenderif the language has grammatical gender. - Set
gram.article— useby_genderfor gendered article systems. - Set
gram.punct— pay attention to spacing conventions (French colon spacing, etc.). - Populate
gram.signallists for the tokeniser. Without these, disambiguation falls back to English defaults. - Add
prompt,time, andlangblocks as needed.
The grammar engine uses a 3-tier lookup for verbs and nouns:
- JSON grammar tables (
gram.verb,gram.noun) — checked first - Go irregular maps (
irregularVerbs,irregularNouns) — English-only fallback - Regular morphology rules — algorithmic, currently English-only
For non-English languages, the JSON tables must be comprehensive since tiers 2 and 3 only cover English.
Dual-Class Words
Words that function as both verbs and nouns (e.g. "commit", "build", "test") must appear in BOTH gram.verb and gram.noun. The tokeniser's disambiguation algorithm resolves which role a word plays based on context signals.
See FINDINGS.md "Dual-Class Word Disambiguation" for the algorithm details.
Validation
There is no explicit schema validation. The loader uses safe type assertions and silently skips malformed entries. To verify a new locale file:
go test -v -run "TestGrammarData" ./...
The grammar tests check that loaded data has expected field counts and known values.