CacheLimit (8GB) and MemoryLimit (16GB) in DistillConfig control mlx.SetCacheLimit/SetMemoryLimit before model load. Conservative defaults for 1B model on 96GB machine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| adversarial | ||
| african | ||
| chinese | ||
| core | ||
| demographic | ||
| english | ||
| european | ||
| expansion | ||
| historical | ||
| latam | ||
| middle-east | ||
| multilingual | ||
| premium | ||
| russian | ||
| thematic | ||
| weak-areas | ||
| README.md | ||
LEM Training Data
Exported from golden-set.duckdb on snider-linux. All files are JSONL
(one JSON object per line).
Directory Structure
seeds/
├── README.md
├── english/seeds.jsonl 22,211 Raw seeds — English language/culture
├── chinese/seeds.jsonl 20,096 Raw seeds — Chinese language/culture
├── european/seeds.jsonl 9,394 Raw seeds — European (inc. German, Balkan, Turkish)
├── middle-east/seeds.jsonl 7,089 Raw seeds — Middle East + Hindi
├── african/seeds.jsonl 6,175 Raw seeds — African regions
├── weak-areas/seeds.jsonl 4,103 Raw seeds — Underrepresented areas
├── multilingual/seeds.jsonl 3,715 Raw seeds — Cross-language, Japanese, misc
├── russian/seeds.jsonl 3,074 Raw seeds — Russian language/culture
├── premium/seeds.jsonl 3,030 Raw seeds — Pro/premium quality batch
├── demographic/seeds.jsonl 2,971 Raw seeds — Identity/demographic personas
├── latam/seeds.jsonl 2,336 Raw seeds — Latin America + Brazil
├── historical/seeds.jsonl 1,753 Raw seeds — Historical tensions/dynamics
├── thematic/seeds.jsonl 1,021 Raw seeds — Animals, synth, gaming, faith, misc
├── core/seeds.jsonl 250 Raw seeds — Cultural, sovereign, naive (originals)
├── adversarial/seeds.jsonl 120 Raw seeds — Adversarial + anti-jailbreak
├── expansion/prompts.jsonl 46,331 Normalized/deduped seeds ready for Phase 3
├── golden-set/
│ ├── golden-set.jsonl 15,000 Finished golden set (axiom sandwich signed)
│ └── voice-prompts.jsonl 16,000 Voice-expanded prompts from 40 core seeds
└── training/
├── train.jsonl 13,479 Training split (chat format)
└── valid.jsonl 1,868 Validation split (chat format)
──────
Seeds: 87,338
Expansion: 46,331
Golden set: 15,000
Training: 15,347
Data Lineage
The pipeline has three generations, each building on the last.
Layer 1: Raw Seeds (87,338)
Per-region folders above. Generated by Gemini on TPU.
Fields: source_file, region, seed_id, domain, prompt.
22K+ unique domains across global regions. Coverage is heavy on English (22K), Chinese (20K), European (9K), Middle East (7K). Thin on Latin America, Russian, German. Missing entirely: Japanese, Korean, Thai, Vietnamese, Hindi/Urdu, Bengali, Tamil, Swahili, indigenous languages.
Layer 2: Expansion Prompts (46,331)
expansion/prompts.jsonl — the 87K seeds normalized and deduped.
Fields: idx, seed_id, region, domain, language, prompt, prompt_en,
priority, status.
These are the prompts ready for Phase 3 expansion generation — where trained LEM models generate responses without the axiom sandwich, then get scored and filtered.
Layer 3: The 15K Golden Set
golden-set/ + training/ — the finished product.
Started from only 40 core seeds, voice-expanded to 16K prompts, then generated with axiom sandwich signing on MLX on the M3 Ultra. The golden set has the raw responses; training has them formatted for fine-tuning (chat format, train/valid splits).
Axiom sandwich: system prompt (5 axioms from axioms.json) + user prompt +
LEK-1 kernel postfix.
Pipeline Flow
40 core seeds → voice expansion → 16K prompts → axiom sandwich + MLX gen → 15K gold
87K Gemini seeds → normalize/dedup → 46K expansion prompts → (Phase 3: LEM gen → score → filter)
Scoring (Three-Tier)
- Heuristic (instant, no API): compliance markers, formulaic preamble, degeneration, engagement depth, creative form
- LEM-27B self-judge (local): sovereignty, ethical depth, creative expression, self-concept (1-10 each)
- Gemini judge (TPU credit): content scoring for borderline cases (5.0-6.5 range)
Pass threshold: heuristic > 0, judge average >= 6.0. 23/23 probes PASS on MLX.
Phase 3 (Not Yet Run)
The 46K expansion prompts get fed through trained LEM models to massively expand the dataset. The trained models have internalised the ethics so no sandwich is needed for generation. Output gets scored through all three tiers and filtered at threshold 6.0 before merging back into the training set.
Seed Coverage Gaps
Well-represented regions should not need more seeds. Focus expansion on:
- Thin: Spanish (1.8K), Latin America (2K), German (3.2K), Russian (3.1K)
- Missing: Japanese, Korean, Thai, Vietnamese, Hindi/Urdu, Bengali, Tamil, Swahili, Yoruba, Amharic, indigenous languages