1 GrammarImprint
Virgil edited this page 2026-02-19 15:55:35 +00:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

GrammarImprint — Grammar Fingerprinting

GrammarImprint is a lossy feature vector projection that converts text into a grammar fingerprint. Content is intentionally discarded — only grammatical structure is preserved. Two texts with similar grammar produce similar imprints, regardless of subject matter.

This is the foundation for the Poindexter classification pipeline and the LEM scoring system (replacing LLM-as-judge with deterministic grammar analysis).

The Imprint Struct

type GrammarImprint struct {
    VerbDistribution   map[string]float64 // verb base → normalised frequency
    TenseDistribution  map[string]float64 // "past"/"gerund"/"base" → ratio
    NounDistribution   map[string]float64 // noun base → normalised frequency
    PluralRatio        float64            // proportion of plural nouns (0.0-1.0)
    DomainVocabulary   map[string]int     // gram.word category → hit count
    ArticleUsage       map[string]float64 // "definite"/"indefinite" → ratio
    PunctuationPattern map[string]float64 // "label"/"progress"/"question" → ratio
    TokenCount         int
    UniqueVerbs        int
    UniqueNouns        int
}

All frequency maps (VerbDistribution, TenseDistribution, NounDistribution, ArticleUsage, PunctuationPattern) are normalised to sum to 1.0 after token collection. This converts raw counts into probability distributions.

Creating an Imprint

tok := reversal.NewTokeniser()
tokens := tok.Tokenise("Deleted the configuration files successfully")
imp := reversal.NewImprint(tokens)

// Result:
// TokenCount: 5
// UniqueVerbs: 1 (delete)
// UniqueNouns: 2 (configuration, file)
// TenseDistribution: {"past": 1.0}
// PluralRatio: 0.5
// ArticleUsage: {"definite": 1.0}

Similarity Calculation

Similar(other) returns 0.0-1.0 using weighted cosine similarity across five distribution dimensions:

Dimension Weight Rationale
Verb distribution 0.30 Most domain-specific signal
Tense distribution 0.20 Temporal patterns distinguish narrative from imperative
Noun distribution 0.25 Entity focus indicates topic area
Article usage 0.15 Grammatical style (technical docs use more "the")
Punctuation pattern 0.10 Minor structural signal

Key behaviour:

  • Same text → similarity = 1.0
  • Similar grammar, different content → similarity 0.5-0.9
  • Different grammatical structure → similarity < 0.3
  • Empty imprints → similarity = 1.0 (no signal = no difference)
  • Dimensions with no data in either imprint are skipped (don't dilute score)

Cosine Similarity (mapSimilarity)

For each distribution pair, computes:

dot = Σ(a[k] × b[k])  for all keys in union(a, b)
|a| = sqrt(Σ(a[k]²))
|b| = sqrt(Σ(b[k]²))
similarity = dot / (|a| × |b|)

The Lossy Property

GrammarImprint is intentionally lossy:

  • "Delete the configuration file" and "Remove the deployment artifact" can produce similar imprints (both: imperative verb + definite article + noun)
  • The actual words don't matter — only their grammatical roles
  • This is the design goal: grammar structure is a privacy-preserving proxy for semantic similarity

Confidence Weighting (Phase 1 — dual-class disambiguation)

Once the disambiguation system is implemented, tokens will carry confidence scores. A dual-class word like "commit" classified as noun with 0.96 confidence contributes:

  • 0.96 to NounDistribution["commit"]
  • 0.04 to VerbDistribution["commit"] (via AltConf)

This preserves uncertainty — similar texts produce similar imprints even when individual token classifications wobble at the boundary.

Phase 2 — Reference Distributions

The 88K scored seeds from LEM Phase 0 will be processed through the imprint pipeline to create per-domain reference distributions. New text can then be compared against these references for classification:

  • Technical text → high similarity to technical reference
  • Creative text → high similarity to creative reference
  • Cross-domain → moderate similarity to multiple references (flagged for review)

Distance metrics (cosine, KL divergence, Mahalanobis) will be added to the Imprint type for this purpose.