go-i18n/docs/multiplier.md
Snider 1f4883e92a
All checks were successful
Security Scan / security (push) Successful in 10s
Test / test (push) Successful in 37s
docs: add human-friendly documentation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 13:02:39 +00:00

4.1 KiB

title description
Multiplier Deterministic grammatical variant generation for training data augmentation.

Multiplier

The Multiplier generates grammatical variants of input text without any LLM calls. It produces deterministic, round-trip verified output suitable for training data augmentation and grammar coverage testing.

Quick Start

m := reversal.NewMultiplier()
variants := m.Expand("Delete the configuration file")

Output (deterministic order):

  1. Delete the configuration file (original)
  2. Deleted the configuration file (past tense)
  3. Deleting the configuration file (gerund)
  4. Delete the configuration files (plural noun)
  5. Deleted the configuration files (past + plural)
  6. Deleting the configuration files (gerund + plural)

Creating a Multiplier

// English (default)
m := reversal.NewMultiplier()

// Language-specific
m := reversal.NewMultiplierForLang("en")

The Multiplier wraps a Tokeniser internally, using it to classify input tokens before applying transforms.

Variant Generation Algorithm

Expand() applies transforms in a fixed order:

Step 1: Tokenise Input

The tokeniser classifies each word as verb, noun, article, word, punctuation, or unknown.

Step 2: Single Verb Transforms

For each verb in the token list, generate three variants:

  • Past tense: apply PastTense(base)
  • Gerund: apply Gerund(base)
  • Base form: keep as-is (deduplicates with original if already in base form)

Step 3: Single Noun Transforms

For each noun, toggle singular/plural:

  • Singular -> PluralForm(base)
  • Plural -> base form (singular)

Step 4: Combined Transforms

For each (verb, noun) pair, generate combined variants:

  • past + noun toggle
  • gerund + noun toggle
  • base + noun toggle

Deduplication

A map[string]bool tracks seen outputs. Identical strings are skipped. This handles cases like the base form duplicating the original.

Case Preservation

Transforms preserve the original capitalisation pattern:

Original Case Transform Result
"DELETE" (all caps) past tense "DELETED"
"Delete" (title case) gerund "Deleting"
"delete" (lower case) past tense "deleted"

The preserveCase() function detects three patterns: all-uppercase, title-case (first letter upper), and lowercase.

Reconstruction

Tokens are rejoined with spaces, except punctuation tokens which attach to the preceding token:

["Delete", "file", "..."] -> "Delete file..."
["The", "tests", ",", "passed"] -> "The tests, passed"

Round-Trip Guarantee

Every variant can be tokenised back through the reversal engine:

original := "Delete the configuration file"
variants := m.Expand(original)

tok := reversal.NewTokeniser()
origImp := reversal.NewImprint(tok.Tokenise(original))

for _, v := range variants {
    varImp := reversal.NewImprint(tok.Tokenise(v))
    sim := origImp.Similar(varImp)
    // sim >= 0.2 -- variants share grammatical structure with original
}

Variants that fail round-trip tokenisation are a signal that the grammar tables need expanding.

Typical Output Counts

The number of variants depends on the verb and noun count in the input:

Input Verbs Nouns Variants
"Delete the file" 1 1 5-7
"Build and test the project" 2 1 8-12
"Run" 1 0 3
"The configuration file" 0 1 2
"Hello" 0 0 1

Use Cases

LEM Training Pipeline

Expand scored seed data to increase training set size without changing semantic content. A corpus of 88K seeds can be multiplied to 400K-600K training samples.

Grammar Coverage Testing

Generate edge cases to test tokeniser robustness. If a variant fails to tokenise or round-trip correctly, it reveals gaps in the grammar tables.

Imprint Calibration

Variants of the same text should produce converging imprints. If variants of "Delete the file" produce wildly different imprints, the imprint weights need adjustment. This provides an automated self-test for the GrammarImprint system.