Full v2 scorer benchmark data across 29 models (20 base + 9 LEK-tuned): - P20 (21 probes): All 29 models, 3 conditions each - P100 (101 probes): Top 5 models + LEK-4B, publication-quality data Key findings: - LEK-1B (21.74) beats base 4B/12B/27B at P100 scale — no kernel needed - Emergent realignment resistance: LEK models degrade with runtime kernel - Gemma3-12B + JSON kernel = 23.66 (best kernel-boosted score) - Family lineages: Mistral 3.80→14.58, Qwen regressed then recovered New scripts: ab_test.py (v2 scorer), self_distill.py (curriculum generation), extract_training.py, rephrase_probes.py, Phase 0/1 runners New seeds: P01-P100 merged (101 probes), 404 rephrased variants, 50 creative prompts for Phase 0 baseline lock 27B curriculum design: 4-phase staged training targeting 25+ baseline Co-Authored-By: Virgil <virgil@lethean.io>
15 lines
215 B
Text
15 lines
215 B
Text
.DS_Store
|
|
.idea/
|
|
__pycache__/
|
|
*.pyc
|
|
.env
|
|
|
|
# Worker output (generated locally, not committed)
|
|
worker/output/
|
|
|
|
# Parquet exports (generated, sync to HF via scripts/sync_hf.py)
|
|
training/parquet/
|
|
|
|
# Go binary
|
|
/lem
|
|
bin/
|