Multiplier — Deterministic Training Data Augmentation

The Multiplier generates grammatical variants of input text without any LLM calls. It produces deterministic, round-trip verified output suitable for training data augmentation and grammar coverage testing.

How It Works

m := reversal.NewMultiplier()
variants := m.Expand("Delete the configuration file")

Output (deterministic order):

Original: "Delete the configuration file"
Past tense: "Deleted the configuration file"
Gerund: "Deleting the configuration file"
Plural noun: "Delete the configuration files"
Past + plural: "Deleted the configuration files"
Gerund + plural: "Deleting the configuration files"

Variant Generation Algorithm

Expand() applies transforms in a fixed order:

Step 1: Tokenise input

The tokeniser classifies each word as verb, noun, article, word, punctuation, or unknown.

Step 2: Single verb transforms

For each verb in the token list, generate three variants:

Past tense: apply PastTense(base)
Gerund: apply Gerund(base)
Base form: keep as-is (deduplicates with original if already base)

Step 3: Single noun transforms

For each noun, toggle singular ↔ plural:

Singular → PluralForm(base)
Plural → base form

Step 4: Combined transforms

For each (verb, noun) pair, generate combined variants:

(past + plural), (gerund + plural), (base + plural)

Deduplication

A map[string]bool tracks seen outputs. Identical strings are skipped. This handles cases like base form duplicating the original.

Case Preservation

Transforms preserve the original capitalisation:

Original Case	Transform	Result
"DELETE" (ALL CAPS)	past tense	"DELETED"
"Delete" (Title)	gerund	"Deleting"
"delete" (lower)	past tense	"deleted"

Reconstruction

Tokens are rejoined with spaces, except punctuation which attaches to the preceding token:

["Delete", "file", "..."] → "Delete file..."
["The", "tests", ",", "passed"] → "The tests, passed"

Round-Trip Guarantee

Every variant can be tokenised back through the reversal engine:

original := "Delete the configuration file"
variants := m.Expand(original)

tok := reversal.NewTokeniser()
origImp := reversal.NewImprint(tok.Tokenise(original))

for _, v := range variants {
    varImp := reversal.NewImprint(tok.Tokenise(v))
    sim := origImp.Similar(varImp)
    // sim >= 0.2 — variants share grammatical structure with original
}

Variants that fail round-trip tokenisation are a signal that the grammar tables need expanding.

Use Cases

LEM Training Pipeline: Expand scored seeds to increase training set size without changing semantic content
Grammar Coverage Testing: Generate edge cases to test tokeniser robustness
Imprint Calibration: Variants of the same text should produce converging imprints — if they don't, the imprint weights need adjustment