R1: Remove "passed", "failed", "skipped" from gram.noun and
gram.word — these are past participles, not nouns.
R2: Add DisambiguationStats and WithWeights tests to
tokeniser_test.go using setup(t) pattern. Remove duplicates
from roundtrip_test.go.
R3: Guard buildSignalIndex per-field so partial locale data
falls back independently rather than silently disabling signals.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update TODO.md and FINDINGS.md with implementation details,
signal weight table, and test coverage summary. Note expanded
dual-class candidates for Phase 2.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DisambiguationStatsFromTokens provides aggregate disambiguation
metrics for Phase 2 calibration. Round-trip tests verify all 6
dual-class words disambiguate correctly in both verb and noun
contexts, and that same-role imprints converge while different-role
imprints diverge.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Transformed tokens get Confidence 1.0 since the transformation
is deterministic and unambiguous.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dual-class tokens contribute to both verb and noun distributions
weighted by Confidence and AltConf. Non-ambiguous tokens (Confidence
1.0, AltConf 0.0) behave identically to before.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify SignalBreakdown is populated when WithSignals() is set and
nil when not. Check individual signal components fire correctly.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NewTokeniser now accepts variadic options (backwards compatible).
Builds dual-class index from verb∩noun overlap and signal word
lookup sets from gram.signal data. Configurable weights via
WithWeights() for future calibration.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Every classified token now carries a Confidence score (1.0 for
unambiguous tokens). SignalBreakdown and SignalComponent types
provide detailed scoring for dual-class disambiguation.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add test, check, file as verbs and run, build as nouns so the
tokeniser can detect them in both grammatical roles. Add 15
contractions to verb_auxiliary signal list for dev text support.
Update reversal tests to use noun-only words (branch) in test
phrases to avoid dual-class ambiguity until disambiguation (Task 5).
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Normalise signal words to lowercase on load (defensive against
mixed-case entries in locale JSON). Strengthen test assertions
with expected counts and spot-checks. Clarify Priors field comment.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Load noun_determiner, verb_auxiliary, and verb_infinitive word lists
from gram.signal in locale JSON. Reserve Priors field for future
corpus-derived per-word disambiguation priors.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multi-signal probabilistic disambiguation with two-pass tokenisation.
Seven weighted signals resolve verb/noun ambiguity for words like
"commit", "run", "test", "check", "file", "build". Confidence scores
flow into imprints for the scoring/comprehension use case.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CLAUDE.md documents the grammar engine contract and sacred rules.
TODO.md is the task dispatch queue from core/go orchestration.
FINDINGS.md captures research and architectural decisions.
Co-Authored-By: Virgil <virgil@lethean.io>
Add !, ;, and , to splitTrailingPunct and matchPunctuation.
Previously only ..., ?, and : were recognised.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverse grammar tables into pattern matchers. 3-tier lookup:
JSON grammar data → irregular verb maps → regular morphology rules.
Verified by round-tripping through forward functions.
Export IrregularVerbs() and IrregularNouns() so the reversal engine
reads from the authoritative source instead of a duplicate list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Grammar engine as zero-cost data augmentation: tense/number/formality
flips across 88K seeds = 528K+ verified training examples with no API
spend. Reversal engine provides automatic QA on transformed variants.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Captures the bidirectional grammar engine idea: using go-i18n tables
in reverse as a deterministic parser to extract semantic imprints from
documents without retaining content. Covers TIM/DataNode architecture,
88K seed calibration, Poindexter integration, and privacy properties.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>