9.1 KiB
| title | description |
|---|---|
| Reversal Engine | Decomposes inflected text back to base forms with grammatical metadata. |
Reversal Engine
The reversal engine (reversal/ package) converts inflected text back to base forms with grammatical metadata. It powers GrammarImprint, the Multiplier, and the classification pipeline.
The forward engine maps base forms to inflected forms:
PastTense("delete") -> "deleted"
Gerund("run") -> "running"
The reversal engine reads those same tables backwards:
MatchVerb("deleted") -> {Base: "delete", Tense: "past"}
MatchVerb("running") -> {Base: "run", Tense: "gerund"}
3-Tier Lookup Strategy
All reverse lookups follow the same three-tier pattern, stopping at the first match:
| Tier | Source | Example |
|---|---|---|
| 1 | JSON grammar data (gram.verb.*, gram.noun.*) |
"committed" -> past of "commit" |
| 2 | Irregular verb/noun Go maps (IrregularVerbs()) |
"went" -> past of "go" |
| 3 | Regular morphology rules + round-trip verification | "processed" -> past of "process" |
JSON takes precedence -- if a verb appears in both en.json and the irregular Go map, the JSON form wins. This lets locale files override built-in rules.
Creating a Tokeniser
// English tokeniser (default)
tok := reversal.NewTokeniser()
// Language-specific tokeniser
tok := reversal.NewTokeniserForLang("en")
// With options
tok := reversal.NewTokeniser(
reversal.WithSignals(), // Enable detailed SignalBreakdown on ambiguous tokens
reversal.WithWeights(customWeights), // Override disambiguation signal weights
)
The tokeniser builds six inverse lookup maps at construction time:
| Map | Direction | Example |
|---|---|---|
pastToBase |
"deleted" -> "delete" | Inverse of gram.verb.*.past |
gerundToBase |
"deleting" -> "delete" | Inverse of gram.verb.*.gerund |
baseVerbs |
"delete" -> true | All known verb bases |
pluralToBase |
"files" -> "file" | Inverse of gram.noun.*.other |
baseNouns |
"file" -> true | All known noun bases |
words |
"url" -> "URL" | Domain vocabulary |
Token Types
TokenUnknown // Unrecognised word
TokenVerb // Matched verb (VerbInfo populated)
TokenNoun // Matched noun (NounInfo populated)
TokenArticle // "a", "an", "the"
TokenWord // Domain word from gram.word map
TokenPunctuation // "...", "?", "!", ":", ";", ","
Tokenise()
Splits text on whitespace and classifies each word using a two-pass algorithm.
tok := reversal.NewTokeniser()
tokens := tok.Tokenise("Deleted the configuration files successfully")
// tokens[0]: Type=TokenVerb, VerbInfo={Base:"delete", Tense:"past"}
// tokens[1]: Type=TokenArticle, ArtType="definite"
// tokens[2]: Type=TokenNoun, NounInfo={Base:"configuration", Plural:false}
// tokens[3]: Type=TokenNoun, NounInfo={Base:"file", Plural:true}
// tokens[4]: Type=TokenUnknown
Pass 1: Classify and Mark
Each word is checked in this priority order:
- Article -- "a", "an", "the"
- Both verb and noun -- if the word is in both tables and is a known dual-class word, check for self-resolving inflections (inflected verb forms resolve as verbs, plural noun forms resolve as nouns). Base forms are marked as ambiguous for Pass 2.
- Verb only -- base forms, past tense, gerunds
- Noun only -- base forms, plurals
- Word -- domain vocabulary from
gram.word - Unknown -- fallback
Pass 2: Resolve Ambiguous Tokens
Dual-class base forms (words like "commit", "test", "run" that are both verbs and nouns) are resolved using seven weighted disambiguation signals:
| Signal | Weight | Description |
|---|---|---|
noun_determiner |
0.35 | Preceding token is "the", "a", "my", "this", etc. |
verb_auxiliary |
0.25 | Preceding token is "is", "was", "will", "can", etc. |
following_class |
0.15 | Next token is article/noun (verb signal) or verb (noun signal) |
sentence_position |
0.10 | Sentence-initial position suggests imperative (verb signal) |
verb_saturation |
0.10 | Confident verb already exists in clause (noun signal) |
inflection_echo |
0.03 | Another token shares the same base in inflected form |
default_prior |
0.02 | Always fires as verb signal (tiebreaker) |
The winning classification gets confidence = its score / total score. When total score is below 0.10 (only default prior fired), a low-information confidence floor of 0.55/0.45 is used.
Token Struct
type Token struct {
Raw string // Original text as it appeared
Lower string // Lowercased form
Type TokenType // Classification
Confidence float64 // 0.0-1.0 classification confidence
AltType TokenType // Runner-up classification (dual-class only)
AltConf float64 // Runner-up confidence
VerbInfo VerbMatch // Populated when Type or AltType is TokenVerb
NounInfo NounMatch // Populated when Type or AltType is TokenNoun
WordCat string // Category key when Type is TokenWord
ArtType string // "definite" or "indefinite" when Type is TokenArticle
PunctType string // "progress", "question", etc. when Type is TokenPunctuation
Signals *SignalBreakdown // Non-nil only when WithSignals() option is set
}
Matching Methods
MatchVerb(word) -> (VerbMatch, bool)
type VerbMatch struct {
Base string // "delete"
Tense string // "past", "gerund", or "base"
Form string // Original inflected form
}
Tier 1: Check baseVerbs[word] (is it a known base verb?)
Tier 2: Check pastToBase[word] and gerundToBase[word] (inverse maps)
Tier 3: Apply reverse morphology rules, then round-trip verify
MatchNoun(word) -> (NounMatch, bool)
type NounMatch struct {
Base string // Singular form
Plural bool // Whether the matched form was plural
Form string // Original form
}
Same 3-tier pattern with pluralToBase inverse map and reverseRegularPlural().
MatchWord(word) -> (string, bool)
Case-insensitive lookup in the domain vocabulary. Returns the category key.
MatchArticle(word) -> (string, bool)
Returns "indefinite" or "definite".
Reverse Morphology Rules
When tiers 1 and 2 produce no match, the engine generates candidate base forms by reversing regular English morphology rules. Multiple candidates are generated and then verified by round-tripping through the forward functions.
Past Tense Reversal
| Pattern | Rule | Example |
|---|---|---|
consonant + ied |
-> consonant + y |
copied -> copy |
doubled consonant + ed |
-> single consonant | stopped -> stop |
stem + d (stem ends in e) |
-> stem | created -> create |
stem + ed |
-> stem | walked -> walk |
Gerund Reversal
| Pattern | Rule | Example |
|---|---|---|
-ying |
-> -ie |
dying -> die |
doubled consonant + ing |
-> single consonant | stopping -> stop |
direct -ing strip |
-> stem | walking -> walk |
add -e back |
-> stem + e |
creating -> create |
Plural Reversal
| Pattern | Rule | Example |
|---|---|---|
consonant + -ies |
-> consonant + y |
entries -> entry |
-ves |
-> -f or -fe |
wolves -> wolf, knives -> knife |
sibilant + -es |
-> sibilant | processes -> process |
-s |
-> stem | servers -> server |
Round-Trip Verification
When tier 3 produces multiple candidate base forms, bestRoundTrip() selects the best one by applying the forward function to each candidate and checking if it reproduces the original inflected form. Only verified candidates are accepted.
When multiple candidates pass verification (ambiguity), selection priority is:
- Known base verb/noun -- candidate exists in the grammar index
- VCe pattern -- candidate ends in vowel-consonant-e (the "magic e" pattern found in real English verbs like "delete", "create", "use"). This avoids phantom verbs like "walke" or "processe" which have consonant-consonant-e endings.
- No trailing e -- default morphology path
- First match -- final tiebreaker
Disambiguation Statistics
stats := reversal.DisambiguationStatsFromTokens(tokens)
// stats.TotalTokens -- total token count
// stats.AmbiguousTokens -- count of dual-class tokens
// stats.ResolvedAsVerb -- how many resolved as verb
// stats.ResolvedAsNoun -- how many resolved as noun
// stats.AvgConfidence -- average confidence across all classified tokens
// stats.LowConfidence -- count where confidence < 0.7
Signal Breakdown
Enable WithSignals() to get detailed scoring on ambiguous tokens:
tok := reversal.NewTokeniser(reversal.WithSignals())
tokens := tok.Tokenise("the commit failed")
for _, t := range tokens {
if t.Signals != nil {
fmt.Printf("verb=%.2f noun=%.2f\n", t.Signals.VerbScore, t.Signals.NounScore)
for _, c := range t.Signals.Components {
fmt.Printf(" %s: weight=%.2f value=%.1f contrib=%.3f (%s)\n",
c.Name, c.Weight, c.Value, c.Contrib, c.Reason)
}
}
}