go-i18n/docs/grammar-table-spec.md
Snider 24e60104d1 feat: add grammar table spec and French locale
Document full JSON schema for gram.* keys in docs/grammar-table-spec.md.
Add French grammar tables (50 verbs, 24 gendered nouns, signals).
Extend loader to parse article by_gender map. Completes Phase 3.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 17:57:22 +00:00

8.7 KiB

Grammar Table Format Specification

JSON schema for gram.* keys in go-i18n locale files. This document defines the contract new languages must implement to work with the grammar engine.

File Location

Locale files live in locales/<lang>.json and are embedded at compile time. The language code follows BCP 47 (e.g. en, fr, de, es, zh).

Top-Level Structure

{
  "gram": {
    "verb":    { ... },
    "noun":    { ... },
    "article": { ... },
    "word":    { ... },
    "punct":   { ... },
    "signal":  { ... },
    "number":  { ... }
  },
  "prompt": { ... },
  "time":   { ... },
  "lang":   { ... }
}

Only the gram block is documented here. The prompt, time, and lang blocks are standard translation keys that consumers manage.

Critical: The gram.* structure MUST remain nested. The loader extracts grammar data before flattening. If you flatten gram.verb.delete.past to a dot-separated key, the grammar engine breaks silently.


gram.verb — Verb Conjugations

Maps verb base forms to their inflected forms. Used by PastTense(), Gerund(), and the reversal tokeniser.

"verb": {
  "<base>": {
    "base":   "<string>",
    "past":   "<string>",
    "gerund": "<string>"
  }
}
Field Required Description
base Optional The infinitive form. Defaults to the key name.
past Required Simple past tense form.
gerund Required Present participle (-ing form).

Example (English):

"delete": { "base": "delete", "past": "deleted", "gerund": "deleting" },
"build":  { "base": "build",  "past": "built",   "gerund": "building" }

Example (French) — would extend to include mood/person:

"supprimer": { "base": "supprimer", "past": "supprimé", "gerund": "supprimant" }

Detection: The loader identifies verb objects by the presence of "past" or "gerund" keys (isVerbFormObject()).

Fallback chain: JSON gram.verb → Go irregularVerbs map → regular morphology rules.


gram.noun — Noun Plural Forms

Maps noun base forms to their singular/plural forms. Used by Pluralize(), PluralForm(), and the reversal tokeniser.

"noun": {
  "<base>": {
    "one":    "<string>",
    "other":  "<string>",
    "gender": "<string>"
  }
}
Field Required Description
one Required Singular form.
other Required Plural form.
gender Optional Grammatical gender ("m", "f", "n"). Unused in English.

Example (English):

"file":   { "one": "file",   "other": "files" },
"person": { "one": "person", "other": "people" }

Example (French) — gender required for article agreement:

"fichier": { "one": "fichier", "other": "fichiers", "gender": "m" },
"branche": { "one": "branche", "other": "branches", "gender": "f" }

Detection: The loader identifies noun objects by the presence of "one" and "other" keys.

Fallback chain: JSON gram.noun → Go irregularNouns map → regular pluralisation rules.


gram.article — Article Rules

Defines article forms. Used by Article() and the reversal tokeniser.

"article": {
  "indefinite": {
    "default": "<string>",
    "vowel":   "<string>"
  },
  "definite": "<string>",
  "by_gender": {
    "<gender>": "<string>"
  }
}
Field Required Description
indefinite.default Required Default indefinite article ("a" in English).
indefinite.vowel Required Vowel-sound indefinite article ("an" in English).
definite Required Definite article ("the" in English).
by_gender Optional Gender-specific articles (French: {"m": "le", "f": "la"}).

Example (English):

"article": {
  "indefinite": { "default": "a", "vowel": "an" },
  "definite": "the"
}

Example (French):

"article": {
  "indefinite": { "default": "un", "vowel": "un" },
  "definite": "le",
  "by_gender": { "m": "le", "f": "la" }
}

gram.word — Domain Vocabulary

Maps lookup keys to their display forms. Used for acronyms, proper nouns, and multi-word terms that need exact casing or spacing. The reversal tokeniser classifies these as TokenWord.

"word": {
  "<key>": "<display_form>"
}

Keys are lowercase with underscores for multi-word terms. Values are the exact display string.

Example:

"url": "URL",
"api": "API",
"go_mod": "go.mod",
"dry_run": "dry run",
"up_to_date": "up to date"

Note: Do not add app-specific terms here. Only grammar-level vocabulary (acronyms, proper nouns) belongs in gram.word. Consumer apps manage their own translation keys outside gram.*.


gram.punct — Punctuation Rules

Language-specific punctuation patterns used by Label() and Progress().

"punct": {
  "label":    "<string>",
  "progress": "<string>"
}
Field Required Description
label Required Suffix for label formatting. English: ":", French: " :".
progress Required Suffix for progress indicators. Typically "...".

Example (English):

"punct": { "label": ":", "progress": "..." }

Example (French) — space before colon:

"punct": { "label": " :", "progress": "..." }

gram.signal — Disambiguation Signals

Word lists used by the reversal tokeniser's two-pass disambiguation algorithm. These resolve dual-class words (e.g. "commit" as verb vs noun) based on surrounding context.

"signal": {
  "noun_determiner": ["<string>", ...],
  "verb_auxiliary":   ["<string>", ...],
  "verb_infinitive":  ["<string>", ...]
}
Field Required Description
noun_determiner Recommended Words that signal a following noun (articles, possessives, quantifiers).
verb_auxiliary Recommended Words that signal a following verb (modals, auxiliaries, contractions).
verb_infinitive Recommended Infinitive markers ("to" in English).

Example (English):

"signal": {
  "noun_determiner": ["the", "a", "an", "this", "that", "my", "your", ...],
  "verb_auxiliary": ["is", "are", "will", "can", "must", "don't", "can't", ...],
  "verb_infinitive": ["to"]
}

Fallback: If a signal list is missing or empty, the tokeniser uses hardcoded English defaults. Each signal list falls back independently — partial locale data is handled gracefully.


gram.number — Number Formatting

Locale-specific number formatting rules.

"number": {
  "thousands": "<string>",
  "decimal":   "<string>",
  "percent":   "<string>"
}
Field Required Description
thousands Required Thousands separator. English: ",", German: ".".
decimal Required Decimal separator. English: ".", German: ",".
percent Required Printf format string. Use %s%% for "42%".

Example (English):

"number": { "thousands": ",", "decimal": ".", "percent": "%s%%" }

Example (German):

"number": { "thousands": ".", "decimal": ",", "percent": "%s %%" }

Adding a New Language

  1. Create locales/<lang>.json with the full gram block.
  2. All gram.verb entries need past and gerund fields at minimum.
  3. All gram.noun entries need one and other fields. Add gender if the language has grammatical gender.
  4. Set gram.article — use by_gender for gendered article systems.
  5. Set gram.punct — pay attention to spacing conventions (French colon spacing, etc.).
  6. Populate gram.signal lists for the tokeniser. Without these, disambiguation falls back to English defaults.
  7. Add prompt, time, and lang blocks as needed.

The grammar engine uses a 3-tier lookup for verbs and nouns:

  1. JSON grammar tables (gram.verb, gram.noun) — checked first
  2. Go irregular maps (irregularVerbs, irregularNouns) — English-only fallback
  3. Regular morphology rules — algorithmic, currently English-only

For non-English languages, the JSON tables must be comprehensive since tiers 2 and 3 only cover English.


Dual-Class Words

Words that function as both verbs and nouns (e.g. "commit", "build", "test") must appear in BOTH gram.verb and gram.noun. The tokeniser's disambiguation algorithm resolves which role a word plays based on context signals.

See FINDINGS.md "Dual-Class Word Disambiguation" for the algorithm details.


Validation

There is no explicit schema validation. The loader uses safe type assertions and silently skips malformed entries. To verify a new locale file:

go test -v -run "TestGrammarData" ./...

The grammar tests check that loaded data has expected field counts and known values.