Snider 1f4883e92a

Security Scan / security (push) Successful in 10s

Details

Test / test (push) Successful in 37s

Details

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-11 13:02:39 +00:00

9.1 KiB

Raw Blame History

title	description
Reversal Engine	Decomposes inflected text back to base forms with grammatical metadata.

Reversal Engine

The reversal engine (reversal/ package) converts inflected text back to base forms with grammatical metadata. It powers GrammarImprint, the Multiplier, and the classification pipeline.

The forward engine maps base forms to inflected forms:

PastTense("delete") -> "deleted"
Gerund("run")       -> "running"

The reversal engine reads those same tables backwards:

MatchVerb("deleted")  -> {Base: "delete", Tense: "past"}
MatchVerb("running")  -> {Base: "run",    Tense: "gerund"}

3-Tier Lookup Strategy

All reverse lookups follow the same three-tier pattern, stopping at the first match:

Tier	Source	Example
1	JSON grammar data (`gram.verb.`, `gram.noun.`)	"committed" -> past of "commit"
2	Irregular verb/noun Go maps (`IrregularVerbs()`)	"went" -> past of "go"
3	Regular morphology rules + round-trip verification	"processed" -> past of "process"

JSON takes precedence -- if a verb appears in both en.json and the irregular Go map, the JSON form wins. This lets locale files override built-in rules.

Creating a Tokeniser

// English tokeniser (default)
tok := reversal.NewTokeniser()

// Language-specific tokeniser
tok := reversal.NewTokeniserForLang("en")

// With options
tok := reversal.NewTokeniser(
    reversal.WithSignals(),                // Enable detailed SignalBreakdown on ambiguous tokens
    reversal.WithWeights(customWeights),   // Override disambiguation signal weights
)

The tokeniser builds six inverse lookup maps at construction time:

Map	Direction	Example
`pastToBase`	"deleted" -> "delete"	Inverse of `gram.verb.*.past`
`gerundToBase`	"deleting" -> "delete"	Inverse of `gram.verb.*.gerund`
`baseVerbs`	"delete" -> true	All known verb bases
`pluralToBase`	"files" -> "file"	Inverse of `gram.noun.*.other`
`baseNouns`	"file" -> true	All known noun bases
`words`	"url" -> "URL"	Domain vocabulary

Token Types

TokenUnknown      // Unrecognised word
TokenVerb         // Matched verb (VerbInfo populated)
TokenNoun         // Matched noun (NounInfo populated)
TokenArticle      // "a", "an", "the"
TokenWord         // Domain word from gram.word map
TokenPunctuation  // "...", "?", "!", ":", ";", ","

Tokenise()

Splits text on whitespace and classifies each word using a two-pass algorithm.

tok := reversal.NewTokeniser()
tokens := tok.Tokenise("Deleted the configuration files successfully")
// tokens[0]: Type=TokenVerb,    VerbInfo={Base:"delete", Tense:"past"}
// tokens[1]: Type=TokenArticle, ArtType="definite"
// tokens[2]: Type=TokenNoun,    NounInfo={Base:"configuration", Plural:false}
// tokens[3]: Type=TokenNoun,    NounInfo={Base:"file", Plural:true}
// tokens[4]: Type=TokenUnknown

Pass 1: Classify and Mark

Each word is checked in this priority order:

Article -- "a", "an", "the"
Both verb and noun -- if the word is in both tables and is a known dual-class word, check for self-resolving inflections (inflected verb forms resolve as verbs, plural noun forms resolve as nouns). Base forms are marked as ambiguous for Pass 2.
Verb only -- base forms, past tense, gerunds
Noun only -- base forms, plurals
Word -- domain vocabulary from gram.word
Unknown -- fallback

Pass 2: Resolve Ambiguous Tokens

Dual-class base forms (words like "commit", "test", "run" that are both verbs and nouns) are resolved using seven weighted disambiguation signals:

Signal	Weight	Description
`noun_determiner`	0.35	Preceding token is "the", "a", "my", "this", etc.
`verb_auxiliary`	0.25	Preceding token is "is", "was", "will", "can", etc.
`following_class`	0.15	Next token is article/noun (verb signal) or verb (noun signal)
`sentence_position`	0.10	Sentence-initial position suggests imperative (verb signal)
`verb_saturation`	0.10	Confident verb already exists in clause (noun signal)
`inflection_echo`	0.03	Another token shares the same base in inflected form
`default_prior`	0.02	Always fires as verb signal (tiebreaker)

The winning classification gets confidence = its score / total score. When total score is below 0.10 (only default prior fired), a low-information confidence floor of 0.55/0.45 is used.

Token Struct

type Token struct {
    Raw        string          // Original text as it appeared
    Lower      string          // Lowercased form
    Type       TokenType       // Classification
    Confidence float64         // 0.0-1.0 classification confidence
    AltType    TokenType       // Runner-up classification (dual-class only)
    AltConf    float64         // Runner-up confidence
    VerbInfo   VerbMatch       // Populated when Type or AltType is TokenVerb
    NounInfo   NounMatch       // Populated when Type or AltType is TokenNoun
    WordCat    string          // Category key when Type is TokenWord
    ArtType    string          // "definite" or "indefinite" when Type is TokenArticle
    PunctType  string          // "progress", "question", etc. when Type is TokenPunctuation
    Signals    *SignalBreakdown // Non-nil only when WithSignals() option is set
}

Matching Methods

MatchVerb(word) -> (VerbMatch, bool)

type VerbMatch struct {
    Base  string // "delete"
    Tense string // "past", "gerund", or "base"
    Form  string // Original inflected form
}

Tier 1: Check baseVerbs[word] (is it a known base verb?) Tier 2: Check pastToBase[word] and gerundToBase[word] (inverse maps) Tier 3: Apply reverse morphology rules, then round-trip verify

MatchNoun(word) -> (NounMatch, bool)

type NounMatch struct {
    Base   string // Singular form
    Plural bool   // Whether the matched form was plural
    Form   string // Original form
}

Same 3-tier pattern with pluralToBase inverse map and reverseRegularPlural().

MatchWord(word) -> (string, bool)

Case-insensitive lookup in the domain vocabulary. Returns the category key.

MatchArticle(word) -> (string, bool)

Returns "indefinite" or "definite".

Reverse Morphology Rules

When tiers 1 and 2 produce no match, the engine generates candidate base forms by reversing regular English morphology rules. Multiple candidates are generated and then verified by round-tripping through the forward functions.

Past Tense Reversal

Pattern	Rule	Example
consonant + `ied`	-> consonant + `y`	copied -> copy
doubled consonant + `ed`	-> single consonant	stopped -> stop
stem + `d` (stem ends in `e`)	-> stem	created -> create
stem + `ed`	-> stem	walked -> walk

Gerund Reversal

Pattern	Rule	Example
`-ying`	-> `-ie`	dying -> die
doubled consonant + `ing`	-> single consonant	stopping -> stop
direct `-ing` strip	-> stem	walking -> walk
add `-e` back	-> stem + `e`	creating -> create

Plural Reversal

Pattern	Rule	Example
consonant + `-ies`	-> consonant + `y`	entries -> entry
`-ves`	-> `-f` or `-fe`	wolves -> wolf, knives -> knife
sibilant + `-es`	-> sibilant	processes -> process
`-s`	-> stem	servers -> server

Round-Trip Verification

When tier 3 produces multiple candidate base forms, bestRoundTrip() selects the best one by applying the forward function to each candidate and checking if it reproduces the original inflected form. Only verified candidates are accepted.

When multiple candidates pass verification (ambiguity), selection priority is:

Known base verb/noun -- candidate exists in the grammar index
VCe pattern -- candidate ends in vowel-consonant-e (the "magic e" pattern found in real English verbs like "delete", "create", "use"). This avoids phantom verbs like "walke" or "processe" which have consonant-consonant-e endings.
No trailing e -- default morphology path
First match -- final tiebreaker

Disambiguation Statistics

stats := reversal.DisambiguationStatsFromTokens(tokens)
// stats.TotalTokens     -- total token count
// stats.AmbiguousTokens -- count of dual-class tokens
// stats.ResolvedAsVerb  -- how many resolved as verb
// stats.ResolvedAsNoun  -- how many resolved as noun
// stats.AvgConfidence   -- average confidence across all classified tokens
// stats.LowConfidence   -- count where confidence < 0.7

Signal Breakdown

Enable WithSignals() to get detailed scoring on ambiguous tokens:

tok := reversal.NewTokeniser(reversal.WithSignals())
tokens := tok.Tokenise("the commit failed")

for _, t := range tokens {
    if t.Signals != nil {
        fmt.Printf("verb=%.2f noun=%.2f\n", t.Signals.VerbScore, t.Signals.NounScore)
        for _, c := range t.Signals.Components {
            fmt.Printf("  %s: weight=%.2f value=%.1f contrib=%.3f (%s)\n",
                c.Name, c.Weight, c.Value, c.Contrib, c.Reason)
        }
    }
}

9.1 KiB Raw Blame History