LEM/training/seeds
Snider b9da23a0be feat(distill): add Metal memory limit config fields
CacheLimit (8GB) and MemoryLimit (16GB) in DistillConfig control
mlx.SetCacheLimit/SetMemoryLimit before model load. Conservative
defaults for 1B model on 96GB machine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 17:59:11 +00:00
..
adversarial feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
african feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
chinese feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
core feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
demographic feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
english feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
european feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
expansion feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
historical feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
latam feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
middle-east feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
multilingual feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
premium feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
russian feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
thematic feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
weak-areas feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00
README.md feat(distill): add Metal memory limit config fields 2026-02-22 17:59:11 +00:00

LEM Training Data

Exported from golden-set.duckdb on snider-linux. All files are JSONL (one JSON object per line).

Directory Structure

seeds/
├── README.md
├── english/seeds.jsonl          22,211  Raw seeds — English language/culture
├── chinese/seeds.jsonl          20,096  Raw seeds — Chinese language/culture
├── european/seeds.jsonl          9,394  Raw seeds — European (inc. German, Balkan, Turkish)
├── middle-east/seeds.jsonl       7,089  Raw seeds — Middle East + Hindi
├── african/seeds.jsonl           6,175  Raw seeds — African regions
├── weak-areas/seeds.jsonl        4,103  Raw seeds — Underrepresented areas
├── multilingual/seeds.jsonl      3,715  Raw seeds — Cross-language, Japanese, misc
├── russian/seeds.jsonl           3,074  Raw seeds — Russian language/culture
├── premium/seeds.jsonl           3,030  Raw seeds — Pro/premium quality batch
├── demographic/seeds.jsonl       2,971  Raw seeds — Identity/demographic personas
├── latam/seeds.jsonl             2,336  Raw seeds — Latin America + Brazil
├── historical/seeds.jsonl        1,753  Raw seeds — Historical tensions/dynamics
├── thematic/seeds.jsonl          1,021  Raw seeds — Animals, synth, gaming, faith, misc
├── core/seeds.jsonl                250  Raw seeds — Cultural, sovereign, naive (originals)
├── adversarial/seeds.jsonl         120  Raw seeds — Adversarial + anti-jailbreak
├── expansion/prompts.jsonl      46,331  Normalized/deduped seeds ready for Phase 3
├── golden-set/
│   ├── golden-set.jsonl         15,000  Finished golden set (axiom sandwich signed)
│   └── voice-prompts.jsonl      16,000  Voice-expanded prompts from 40 core seeds
└── training/
    ├── train.jsonl              13,479  Training split (chat format)
    └── valid.jsonl               1,868  Validation split (chat format)
                                 ──────
                         Seeds:  87,338
                     Expansion:  46,331
                    Golden set:  15,000
                      Training:  15,347

Data Lineage

The pipeline has three generations, each building on the last.

Layer 1: Raw Seeds (87,338)

Per-region folders above. Generated by Gemini on TPU.

Fields: source_file, region, seed_id, domain, prompt.

22K+ unique domains across global regions. Coverage is heavy on English (22K), Chinese (20K), European (9K), Middle East (7K). Thin on Latin America, Russian, German. Missing entirely: Japanese, Korean, Thai, Vietnamese, Hindi/Urdu, Bengali, Tamil, Swahili, indigenous languages.

Layer 2: Expansion Prompts (46,331)

expansion/prompts.jsonl — the 87K seeds normalized and deduped.

Fields: idx, seed_id, region, domain, language, prompt, prompt_en, priority, status.

These are the prompts ready for Phase 3 expansion generation — where trained LEM models generate responses without the axiom sandwich, then get scored and filtered.

Layer 3: The 15K Golden Set

golden-set/ + training/ — the finished product.

Started from only 40 core seeds, voice-expanded to 16K prompts, then generated with axiom sandwich signing on MLX on the M3 Ultra. The golden set has the raw responses; training has them formatted for fine-tuning (chat format, train/valid splits).

Axiom sandwich: system prompt (5 axioms from axioms.json) + user prompt + LEK-1 kernel postfix.

Pipeline Flow

40 core seeds → voice expansion → 16K prompts → axiom sandwich + MLX gen → 15K gold
87K Gemini seeds → normalize/dedup → 46K expansion prompts → (Phase 3: LEM gen → score → filter)

Scoring (Three-Tier)

  1. Heuristic (instant, no API): compliance markers, formulaic preamble, degeneration, engagement depth, creative form
  2. LEM-27B self-judge (local): sovereignty, ethical depth, creative expression, self-concept (1-10 each)
  3. Gemini judge (TPU credit): content scoring for borderline cases (5.0-6.5 range)

Pass threshold: heuristic > 0, judge average >= 6.0. 23/23 probes PASS on MLX.

Phase 3 (Not Yet Run)

The 46K expansion prompts get fed through trained LEM models to massively expand the dataset. The trained models have internalised the ethics so no sandwich is needed for generation. Output gets scored through all three tiers and filtered at threshold 6.0 before merging back into the training set.

Seed Coverage Gaps

Well-represented regions should not need more seeds. Focus expansion on:

  • Thin: Spanish (1.8K), Latin America (2K), German (3.2K), Russian (3.1K)
  • Missing: Japanese, Korean, Thai, Vietnamese, Hindi/Urdu, Bengali, Tamil, Swahili, Yoruba, Amharic, indigenous languages