go-i18n/docs/reversal.md
Snider 1f4883e92a
All checks were successful
Security Scan / security (push) Successful in 10s
Test / test (push) Successful in 37s
docs: add human-friendly documentation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 13:02:39 +00:00

9.1 KiB

title description
Reversal Engine Decomposes inflected text back to base forms with grammatical metadata.

Reversal Engine

The reversal engine (reversal/ package) converts inflected text back to base forms with grammatical metadata. It powers GrammarImprint, the Multiplier, and the classification pipeline.

The forward engine maps base forms to inflected forms:

PastTense("delete") -> "deleted"
Gerund("run")       -> "running"

The reversal engine reads those same tables backwards:

MatchVerb("deleted")  -> {Base: "delete", Tense: "past"}
MatchVerb("running")  -> {Base: "run",    Tense: "gerund"}

3-Tier Lookup Strategy

All reverse lookups follow the same three-tier pattern, stopping at the first match:

Tier Source Example
1 JSON grammar data (gram.verb.*, gram.noun.*) "committed" -> past of "commit"
2 Irregular verb/noun Go maps (IrregularVerbs()) "went" -> past of "go"
3 Regular morphology rules + round-trip verification "processed" -> past of "process"

JSON takes precedence -- if a verb appears in both en.json and the irregular Go map, the JSON form wins. This lets locale files override built-in rules.

Creating a Tokeniser

// English tokeniser (default)
tok := reversal.NewTokeniser()

// Language-specific tokeniser
tok := reversal.NewTokeniserForLang("en")

// With options
tok := reversal.NewTokeniser(
    reversal.WithSignals(),                // Enable detailed SignalBreakdown on ambiguous tokens
    reversal.WithWeights(customWeights),   // Override disambiguation signal weights
)

The tokeniser builds six inverse lookup maps at construction time:

Map Direction Example
pastToBase "deleted" -> "delete" Inverse of gram.verb.*.past
gerundToBase "deleting" -> "delete" Inverse of gram.verb.*.gerund
baseVerbs "delete" -> true All known verb bases
pluralToBase "files" -> "file" Inverse of gram.noun.*.other
baseNouns "file" -> true All known noun bases
words "url" -> "URL" Domain vocabulary

Token Types

TokenUnknown      // Unrecognised word
TokenVerb         // Matched verb (VerbInfo populated)
TokenNoun         // Matched noun (NounInfo populated)
TokenArticle      // "a", "an", "the"
TokenWord         // Domain word from gram.word map
TokenPunctuation  // "...", "?", "!", ":", ";", ","

Tokenise()

Splits text on whitespace and classifies each word using a two-pass algorithm.

tok := reversal.NewTokeniser()
tokens := tok.Tokenise("Deleted the configuration files successfully")
// tokens[0]: Type=TokenVerb,    VerbInfo={Base:"delete", Tense:"past"}
// tokens[1]: Type=TokenArticle, ArtType="definite"
// tokens[2]: Type=TokenNoun,    NounInfo={Base:"configuration", Plural:false}
// tokens[3]: Type=TokenNoun,    NounInfo={Base:"file", Plural:true}
// tokens[4]: Type=TokenUnknown

Pass 1: Classify and Mark

Each word is checked in this priority order:

  1. Article -- "a", "an", "the"
  2. Both verb and noun -- if the word is in both tables and is a known dual-class word, check for self-resolving inflections (inflected verb forms resolve as verbs, plural noun forms resolve as nouns). Base forms are marked as ambiguous for Pass 2.
  3. Verb only -- base forms, past tense, gerunds
  4. Noun only -- base forms, plurals
  5. Word -- domain vocabulary from gram.word
  6. Unknown -- fallback

Pass 2: Resolve Ambiguous Tokens

Dual-class base forms (words like "commit", "test", "run" that are both verbs and nouns) are resolved using seven weighted disambiguation signals:

Signal Weight Description
noun_determiner 0.35 Preceding token is "the", "a", "my", "this", etc.
verb_auxiliary 0.25 Preceding token is "is", "was", "will", "can", etc.
following_class 0.15 Next token is article/noun (verb signal) or verb (noun signal)
sentence_position 0.10 Sentence-initial position suggests imperative (verb signal)
verb_saturation 0.10 Confident verb already exists in clause (noun signal)
inflection_echo 0.03 Another token shares the same base in inflected form
default_prior 0.02 Always fires as verb signal (tiebreaker)

The winning classification gets confidence = its score / total score. When total score is below 0.10 (only default prior fired), a low-information confidence floor of 0.55/0.45 is used.

Token Struct

type Token struct {
    Raw        string          // Original text as it appeared
    Lower      string          // Lowercased form
    Type       TokenType       // Classification
    Confidence float64         // 0.0-1.0 classification confidence
    AltType    TokenType       // Runner-up classification (dual-class only)
    AltConf    float64         // Runner-up confidence
    VerbInfo   VerbMatch       // Populated when Type or AltType is TokenVerb
    NounInfo   NounMatch       // Populated when Type or AltType is TokenNoun
    WordCat    string          // Category key when Type is TokenWord
    ArtType    string          // "definite" or "indefinite" when Type is TokenArticle
    PunctType  string          // "progress", "question", etc. when Type is TokenPunctuation
    Signals    *SignalBreakdown // Non-nil only when WithSignals() option is set
}

Matching Methods

MatchVerb(word) -> (VerbMatch, bool)

type VerbMatch struct {
    Base  string // "delete"
    Tense string // "past", "gerund", or "base"
    Form  string // Original inflected form
}

Tier 1: Check baseVerbs[word] (is it a known base verb?) Tier 2: Check pastToBase[word] and gerundToBase[word] (inverse maps) Tier 3: Apply reverse morphology rules, then round-trip verify

MatchNoun(word) -> (NounMatch, bool)

type NounMatch struct {
    Base   string // Singular form
    Plural bool   // Whether the matched form was plural
    Form   string // Original form
}

Same 3-tier pattern with pluralToBase inverse map and reverseRegularPlural().

MatchWord(word) -> (string, bool)

Case-insensitive lookup in the domain vocabulary. Returns the category key.

MatchArticle(word) -> (string, bool)

Returns "indefinite" or "definite".

Reverse Morphology Rules

When tiers 1 and 2 produce no match, the engine generates candidate base forms by reversing regular English morphology rules. Multiple candidates are generated and then verified by round-tripping through the forward functions.

Past Tense Reversal

Pattern Rule Example
consonant + ied -> consonant + y copied -> copy
doubled consonant + ed -> single consonant stopped -> stop
stem + d (stem ends in e) -> stem created -> create
stem + ed -> stem walked -> walk

Gerund Reversal

Pattern Rule Example
-ying -> -ie dying -> die
doubled consonant + ing -> single consonant stopping -> stop
direct -ing strip -> stem walking -> walk
add -e back -> stem + e creating -> create

Plural Reversal

Pattern Rule Example
consonant + -ies -> consonant + y entries -> entry
-ves -> -f or -fe wolves -> wolf, knives -> knife
sibilant + -es -> sibilant processes -> process
-s -> stem servers -> server

Round-Trip Verification

When tier 3 produces multiple candidate base forms, bestRoundTrip() selects the best one by applying the forward function to each candidate and checking if it reproduces the original inflected form. Only verified candidates are accepted.

When multiple candidates pass verification (ambiguity), selection priority is:

  1. Known base verb/noun -- candidate exists in the grammar index
  2. VCe pattern -- candidate ends in vowel-consonant-e (the "magic e" pattern found in real English verbs like "delete", "create", "use"). This avoids phantom verbs like "walke" or "processe" which have consonant-consonant-e endings.
  3. No trailing e -- default morphology path
  4. First match -- final tiebreaker

Disambiguation Statistics

stats := reversal.DisambiguationStatsFromTokens(tokens)
// stats.TotalTokens     -- total token count
// stats.AmbiguousTokens -- count of dual-class tokens
// stats.ResolvedAsVerb  -- how many resolved as verb
// stats.ResolvedAsNoun  -- how many resolved as noun
// stats.AvgConfidence   -- average confidence across all classified tokens
// stats.LowConfidence   -- count where confidence < 0.7

Signal Breakdown

Enable WithSignals() to get detailed scoring on ambiguous tokens:

tok := reversal.NewTokeniser(reversal.WithSignals())
tokens := tok.Tokenise("the commit failed")

for _, t := range tokens {
    if t.Signals != nil {
        fmt.Printf("verb=%.2f noun=%.2f\n", t.Signals.VerbScore, t.Signals.NounScore)
        for _, c := range t.Signals.Components {
            fmt.Printf("  %s: weight=%.2f value=%.1f contrib=%.3f (%s)\n",
                c.Name, c.Weight, c.Value, c.Contrib, c.Reason)
        }
    }
}