This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
GrammarImprint — Grammar Fingerprinting
GrammarImprint is a lossy feature vector projection that converts text into a grammar fingerprint. Content is intentionally discarded — only grammatical structure is preserved. Two texts with similar grammar produce similar imprints, regardless of subject matter.
This is the foundation for the Poindexter classification pipeline and the LEM scoring system (replacing LLM-as-judge with deterministic grammar analysis).
The Imprint Struct
type GrammarImprint struct {
VerbDistribution map[string]float64 // verb base → normalised frequency
TenseDistribution map[string]float64 // "past"/"gerund"/"base" → ratio
NounDistribution map[string]float64 // noun base → normalised frequency
PluralRatio float64 // proportion of plural nouns (0.0-1.0)
DomainVocabulary map[string]int // gram.word category → hit count
ArticleUsage map[string]float64 // "definite"/"indefinite" → ratio
PunctuationPattern map[string]float64 // "label"/"progress"/"question" → ratio
TokenCount int
UniqueVerbs int
UniqueNouns int
}
All frequency maps (VerbDistribution, TenseDistribution, NounDistribution, ArticleUsage, PunctuationPattern) are normalised to sum to 1.0 after token collection. This converts raw counts into probability distributions.
Creating an Imprint
tok := reversal.NewTokeniser()
tokens := tok.Tokenise("Deleted the configuration files successfully")
imp := reversal.NewImprint(tokens)
// Result:
// TokenCount: 5
// UniqueVerbs: 1 (delete)
// UniqueNouns: 2 (configuration, file)
// TenseDistribution: {"past": 1.0}
// PluralRatio: 0.5
// ArticleUsage: {"definite": 1.0}
Similarity Calculation
Similar(other) returns 0.0-1.0 using weighted cosine similarity across five distribution dimensions:
| Dimension | Weight | Rationale |
|---|---|---|
| Verb distribution | 0.30 | Most domain-specific signal |
| Tense distribution | 0.20 | Temporal patterns distinguish narrative from imperative |
| Noun distribution | 0.25 | Entity focus indicates topic area |
| Article usage | 0.15 | Grammatical style (technical docs use more "the") |
| Punctuation pattern | 0.10 | Minor structural signal |
Key behaviour:
- Same text → similarity = 1.0
- Similar grammar, different content → similarity 0.5-0.9
- Different grammatical structure → similarity < 0.3
- Empty imprints → similarity = 1.0 (no signal = no difference)
- Dimensions with no data in either imprint are skipped (don't dilute score)
Cosine Similarity (mapSimilarity)
For each distribution pair, computes:
dot = Σ(a[k] × b[k]) for all keys in union(a, b)
|a| = sqrt(Σ(a[k]²))
|b| = sqrt(Σ(b[k]²))
similarity = dot / (|a| × |b|)
The Lossy Property
GrammarImprint is intentionally lossy:
- "Delete the configuration file" and "Remove the deployment artifact" can produce similar imprints (both: imperative verb + definite article + noun)
- The actual words don't matter — only their grammatical roles
- This is the design goal: grammar structure is a privacy-preserving proxy for semantic similarity
Confidence Weighting (Phase 1 — dual-class disambiguation)
Once the disambiguation system is implemented, tokens will carry confidence scores. A dual-class word like "commit" classified as noun with 0.96 confidence contributes:
- 0.96 to
NounDistribution["commit"] - 0.04 to
VerbDistribution["commit"](via AltConf)
This preserves uncertainty — similar texts produce similar imprints even when individual token classifications wobble at the boundary.
Phase 2 — Reference Distributions
The 88K scored seeds from LEM Phase 0 will be processed through the imprint pipeline to create per-domain reference distributions. New text can then be compared against these references for classification:
- Technical text → high similarity to technical reference
- Creative text → high similarity to creative reference
- Cross-domain → moderate similarity to multiple references (flagged for review)
Distance metrics (cosine, KL divergence, Mahalanobis) will be added to the Imprint type for this purpose.