Table of Contents
Multiplier — Deterministic Training Data Augmentation
The Multiplier generates grammatical variants of input text without any LLM calls. It produces deterministic, round-trip verified output suitable for training data augmentation and grammar coverage testing.
How It Works
m := reversal.NewMultiplier()
variants := m.Expand("Delete the configuration file")
Output (deterministic order):
- Original: "Delete the configuration file"
- Past tense: "Deleted the configuration file"
- Gerund: "Deleting the configuration file"
- Plural noun: "Delete the configuration files"
- Past + plural: "Deleted the configuration files"
- Gerund + plural: "Deleting the configuration files"
Variant Generation Algorithm
Expand() applies transforms in a fixed order:
Step 1: Tokenise input
The tokeniser classifies each word as verb, noun, article, word, punctuation, or unknown.
Step 2: Single verb transforms
For each verb in the token list, generate three variants:
- Past tense: apply
PastTense(base) - Gerund: apply
Gerund(base) - Base form: keep as-is (deduplicates with original if already base)
Step 3: Single noun transforms
For each noun, toggle singular ↔ plural:
- Singular →
PluralForm(base) - Plural → base form
Step 4: Combined transforms
For each (verb, noun) pair, generate combined variants:
- (past + plural), (gerund + plural), (base + plural)
Deduplication
A map[string]bool tracks seen outputs. Identical strings are skipped. This handles cases like base form duplicating the original.
Case Preservation
Transforms preserve the original capitalisation:
| Original Case | Transform | Result |
|---|---|---|
| "DELETE" (ALL CAPS) | past tense | "DELETED" |
| "Delete" (Title) | gerund | "Deleting" |
| "delete" (lower) | past tense | "deleted" |
Reconstruction
Tokens are rejoined with spaces, except punctuation which attaches to the preceding token:
["Delete", "file", "..."] → "Delete file..."
["The", "tests", ",", "passed"] → "The tests, passed"
Round-Trip Guarantee
Every variant can be tokenised back through the reversal engine:
original := "Delete the configuration file"
variants := m.Expand(original)
tok := reversal.NewTokeniser()
origImp := reversal.NewImprint(tok.Tokenise(original))
for _, v := range variants {
varImp := reversal.NewImprint(tok.Tokenise(v))
sim := origImp.Similar(varImp)
// sim >= 0.2 — variants share grammatical structure with original
}
Variants that fail round-trip tokenisation are a signal that the grammar tables need expanding.
Use Cases
- LEM Training Pipeline: Expand scored seeds to increase training set size without changing semantic content
- Grammar Coverage Testing: Generate edge cases to test tokeniser robustness
- Imprint Calibration: Variants of the same text should produce converging imprints — if they don't, the imprint weights need adjustment