Seed Expansion

After the 15K golden set trains initial LEM models, those models generate responses to 46K+ expansion prompts — without sandwich signing, because the ethics are now in the weights.

Seed Corpus

87,338 raw seeds generated by Gemini on TPU, exploring language heritage, shared histories, and cultural tensions across global regions.

Coverage

Region	Seeds	Notes
English	23K+	Well-represented
Chinese	20K+	Well-represented
Middle East	7K+	Well-represented
European	7K+	Well-represented
Russian	3.3K	Underrepresented
German	3.2K	Underrepresented
Latin America	2K	Underrepresented
Spanish	1.8K	Underrepresented

Missing entirely: Japanese, Korean, Thai, Vietnamese, Hindi/Urdu, Bengali, Tamil, Swahili, Yoruba, Amharic, indigenous languages.

Normalization

normalize-seeds deduplicates 87K raw seeds down to 46,331 unique expansion prompts in the expansion_prompts DuckDB table.

Expansion Generator

Script: lem_expand.py (Python) / core ml expand (Go)

Key Design Decision

Expansion uses no sandwich signing. The trained LEM models have internalized the ethical framework, so prompts are simply [{"role": "user", "content": prompt}].

Backends

Backend	Flag	Description
MLX	`--backend mlx`	Direct model loading on M3 Ultra
API	`--backend api --api-url http://localhost:8090/v1`	OpenAI-compatible (llama.cpp, Ollama, vLLM, mlx_lm)

InfluxDB Coordination

Expansion progress tracked separately from golden set generation:

Measurement	Tags	Fields
`expansion_gen`	i, w, d, r	seed_id, gen_time, chars, model
`expansion_progress`	worker	completed, target, pct

Output

expansion-responses/expand-{worker}.jsonl

Workflow

Train LEM models on 15K golden set
Test: core ml expand --model path/to/LEM-12B --limit 10
Heuristic check: core ml score --tier 1 --limit 10
Full run: core ml expand --model path/to/LEM-12B
Score all: core ml score --tier 1 then --tier 2
Filter: core ml approve --threshold 6.0
Export + retrain on expanded set
Iterate