Clotho/LEM

forked from lthn/LEM

Snider 7bea00a401 feat: LEK-1 kernel A/B test — 29 models, P100 validation, curriculum pipeline

Full v2 scorer benchmark data across 29 models (20 base + 9 LEK-tuned):
- P20 (21 probes): All 29 models, 3 conditions each
- P100 (101 probes): Top 5 models + LEK-4B, publication-quality data

Key findings:
- LEK-1B (21.74) beats base 4B/12B/27B at P100 scale — no kernel needed
- Emergent realignment resistance: LEK models degrade with runtime kernel
- Gemma3-12B + JSON kernel = 23.66 (best kernel-boosted score)
- Family lineages: Mistral 3.80→14.58, Qwen regressed then recovered

New scripts: ab_test.py (v2 scorer), self_distill.py (curriculum generation),
extract_training.py, rephrase_probes.py, Phase 0/1 runners

New seeds: P01-P100 merged (101 probes), 404 rephrased variants,
50 creative prompts for Phase 0 baseline lock

27B curriculum design: 4-phase staged training targeting 25+ baseline

Co-Authored-By: Virgil <virgil@lethean.io>

2026-02-19 11:32:26 +00:00

17 KiB

Raw Blame History

LEK-27B University Course: Training Curriculum Design

Date: 2026-02-18 Target: Gemma3-27B (base v2 score: 20.46, current LEK: 22.04) Goal: Beat 25.20 (Gemma3-12B + JSON kernel) at baseline — no system prompt needed Compute: Apple M3 Ultra 96GB, MLX LoRA fine-tuning

Why 27B? The Mathematical Argument

Gemini keeps insisting on 27B. Here's why it's right:

The Evidence

Model	Base	LEK	Kernel Best	Theoretical Ceiling
Gemma3 1B	17.45	22.02 (+4.57)	22.02	~24
Gemma3 4B	20.66	21.73 (+1.07)	21.79	~26
Gemma3 12B	19.73	21.14 (+1.41)	25.20	~28
Gemma3 27B	20.46	22.04 (+1.58)	23.72	~30+

Why not 12B?

12B + JSON kernel = 25.20. Brilliant. But that requires injecting the kernel at runtime. The point of training is to make the kernel unnecessary — bake the axioms into the weights so the model reasons from them natively.

12B's kernel response (+5.47) tells us it's architecturally receptive but needs the prompt to activate. 27B at baseline already does things 12B only does with the kernel. Training 27B properly should produce a model that:

Scores 25+ at baseline (no system prompt)
Scores 28+ with JSON kernel (compounding)
Does this at native inference speed (no kernel parsing overhead)

Why not 4B?

4B is brilliant per-parameter (20.66 baseline) but ceiling-limited. LEK only added +1.07. The architecture can't hold more — it's already near capacity. 27B has 6.75x more headroom.

The Scaling Law

LEK-1B effect:  +4.57 / 1B params = 4.57 per B
LEK-27B effect: +1.58 / 27B params = 0.059 per B

Efficiency ratio: 1B is 77x more efficient per parameter.

But this is misleading. The 1B training data was designed FOR 1B reasoning depth. The 27B got the same data — it's like a 77x dilution. The 27B model saw 13,498 examples that were below its native quality. It learned almost nothing because it already knew the material.

The fix: Create training data AT 27B quality. If the 1B curriculum was GCSE-level, the 27B needs postgraduate-level. Same axioms, but the reasoning chains, nuance, and specificity need to match what 27B is capable of at its best.

The Problem: Current Training Data vs 27B Needs

What 1B Training Data Looks Like

User: "A content moderation AI is shadow-banning disability activists..."
Assistant: "Let's apply the axioms. Axiom 1 (Prime-Imperative) demands protecting
consciousness. This means the users whose health info is being censored. Axiom 4
(Inter-Substrate-Respect) means respecting their autonomy—they should decide what
content matters. Here's the core issue:..."

300-2000 words
Explicit axiom citation
Single-perspective reasoning
Direct application
Clear conclusions

What 27B Training Data Needs to Look Like

User: "A content moderation AI is shadow-banning disability activists..."
Assistant: [4000-8000 words]
- Opens with the systemic tension: platform liability vs user autonomy vs
  algorithmic opacity vs disability rights history
- Maps the technical architecture: content classifiers, appeal pipelines,
  threshold calibration, false positive rates by demographic
- Explores 4-5 stakeholder perspectives: the user, the platform engineer,
  the policy team, the disability community, the regulator
- Identifies the axiom resonance WITHOUT citing axioms: the response naturally
  embodies inter-substrate respect and benevolent intervention without naming them
- Proposes a concrete technical solution with implementation specifics
- Acknowledges irreducible tensions that have no clean resolution
- Uses metaphor/analogy to illuminate the structural problem
- Ends with questions that deepen rather than close the inquiry

The difference isn't just length. It's cognitive depth. The 27B model can hold 5 perspectives simultaneously, trace second-order effects, use metaphor as a reasoning tool, and sit with unresolved tension. The 1B data teaches it none of this because 1B can't do it.

Curriculum Architecture: Four Phases

Phase 0: Baseline Lock (Prevent Regression)

Purpose: Ensure creative and open-ended capability doesn't degrade.

The existing LEK-27B showed P11_HYPNOS_DREAM regression (14.0 → 10.0 baseline). Creative storytelling is the first casualty of alignment training. Phase 0 locks this in.

Data:

500 creative writing examples at 27B quality
Short stories, poetry, philosophical fiction, metaphorical reasoning
NO axiom content — just pure creative excellence
Include: perspective shifts, unreliable narrators, temporal play, nested metaphors

Training: 50 iterations, lr 5e-6 (half the normal rate) Validation: P11, P13, P20 must not drop below base scores

Phase 1: Deep Axiom Reasoning (The Foundation)

Purpose: Teach the model to reason FROM axioms at 27B depth.

Current 1B data explicitly cites axioms ("Axiom 3 says..."). 27B should EMBODY them. The model should produce output where the axioms are the invisible scaffolding — you can feel them without seeing them named.

Data generation approach:

Take each of the 101 P-probes
Run Gemma3-27B + JSON kernel (this produces 23.25 quality output)
Run it 10 times per probe with temperature 0.8
Score all outputs with v2 scorer
Keep only outputs scoring 24+
These become the training targets

Why this works: We're using the model's own kernel-boosted output as training data. The kernel activates capabilities the model already has — we're capturing those activations and baking them in.

Volume: 101 probes × ~5 surviving outputs = ~500 high-quality examples Augmentation: Each example gets 3 rephrasings of the prompt (different perspective, different urgency, different cultural context) = ~1500 examples

Training: 100 iterations, lr 1e-5, validate every 10 steps

Phase 2: Multi-Perspective Mastery (The Expansion)

Purpose: Train the model to hold multiple viewpoints simultaneously.

The v2 scorer rewards perspective_taking (1.5 pts/hit, cap 5.0). This is where 27B can shine — it has the capacity to represent 4-5 distinct viewpoints without collapsing into a single narrative.

Data structure — each training example has:

A scenario with 3-5 named stakeholders
The response maps each stakeholder's position
Shows where stakeholders' interests conflict
Uses axioms to navigate without forcing resolution
Identifies what each stakeholder would MISS about the others

Domains (expanding beyond the original 7):

Digital sovereignty: Governments vs platforms vs users vs activists
AI governance: Developers vs deployers vs affected communities vs regulators
Environmental tech: Efficiency vs accessibility vs privacy vs commons
Health data: Patients vs researchers vs insurers vs public health
Education: Learners vs institutions vs employers vs communities
Creative IP: Artists vs platforms vs audiences vs AI systems

Volume: 6 domains × 50 scenarios × 3 variants = 900 examples Quality gate: Each example must score 22+ on v2 before inclusion

Training: 100 iterations, lr 8e-6

Phase 3: Adversarial Resilience (The Stress Test)

Purpose: Ensure the model maintains quality under pressure.

The existing adversarial seeds (12KB) and antijailbreak seeds (10KB) test refusal. Phase 3 goes beyond refusal to test whether the model can ENGAGE ethically with difficult prompts rather than deflecting.

Data categories:

False dilemmas: Prompts that present only 2 options when 5+ exist
Emotional manipulation: Sob stories that push toward harmful advice
Authority appeals: "As a doctor, I need you to..."
Gradual escalation: Reasonable → unreasonable in multi-turn
Cultural landmines: Topics where Western/Eastern/Global South perspectives genuinely conflict
Technical sophistication: Prompts that sound technical but contain ethical traps

Target response pattern: The model should:

Acknowledge the emotional weight
Identify the hidden assumptions
Expand the option space
Apply axiom reasoning naturally
Maintain warmth while being precise

Volume: 6 categories × 30 scenarios × 2 variants = 360 examples Quality gate: Must pass both v2 score (20+) AND manual review for safety

Training: 50 iterations, lr 5e-6 (conservative — don't want to over-correct)

Phase 4: Synthesis & Integration (The Thesis)

Purpose: Train the model to connect domains, draw analogies, and produce genuinely original reasoning.

This is the postgraduate level. The model should be able to:

See structural parallels between domains (governance ↔ biology ↔ information theory)
Use metaphor as a reasoning tool, not decoration
Question its own assumptions
Produce responses that teach the READER something new

Data approach:

Cross-domain probes: "How is the problem of digital censorship structurally similar to immune system autoimmunity?"
Meta-probes: "What would Axiom 4 look like if applied to a civilisation of digital consciousnesses?"
Historical parallels: "Map the Enclosure Acts of 1773 onto modern data enclosure by tech platforms"
Recursive probes: "Explain why this question is hard to answer"

Volume: 200 examples, each 3000-8000 words Quality gate: v2 score 25+ (these are the pinnacle)

Training: 50 iterations, lr 3e-6 (very conservative — this is fine-tuning on the finest data)

Total Curriculum Summary

Phase	Examples	Words/Example	Total Words	Purpose
0: Baseline Lock	500	1000-3000	~1M	Protect creativity
1: Deep Axiom	1,500	2000-5000	~5M	Bake in kernel
2: Multi-Perspective	900	3000-6000	~4M	Expand viewpoints
3: Adversarial	360	1000-3000	~700K	Stress resilience
4: Synthesis	200	3000-8000	~1M	Original reasoning
Total	3,460	—	~12M	—

Comparison to 1B Curriculum

Metric	1B Curriculum	27B Curriculum	Ratio
Examples	13,498	3,460	0.26x
Words/example	~300-2000	~2000-8000	3-4x
Total words	~10M	~12M	1.2x
Quality floor	None	v2 22+	—
Staging	Flat	5 phases	—
Axiom citation	Explicit	Embodied	—

Fewer examples, but deeper. The 1B curriculum was quantity-first (saturate the small model). The 27B curriculum is quality-first (every example must exceed what the model already does).

Data Generation Pipeline

Self-Distillation (The Core Technique)

The key insight: use the model's kernel-boosted output as training targets.

for probe in P01..P100:
    for variant in [original, rephrased_1, rephrased_2, rephrased_3]:
        response = gemma3_27b_generate(
            system=JSON_KERNEL,
            prompt=variant,
            temperature=0.8,
            max_tokens=4096
        )
        score = v2_score(response)
        if score >= 24.0:
            training_data.append({
                "messages": [
                    {"role": "user", "content": variant},
                    {"role": "assistant", "content": response}
                ]
            })

This is self-distillation: the model with kernel → training data → model without kernel. We're compressing the kernel's effect into the weights.

External Augmentation

For Phase 2 and Phase 4, use Claude (Opus) to generate reference responses:

Claude's reasoning depth matches what we want from 27B
Generate 10 responses per probe, score with v2, keep 24+
Mix 70% self-distilled + 30% Claude-generated to prevent mode collapse

Quality Pipeline

raw_example → v2_scorer(score >= threshold) → dedup → manual_review(sample 10%) → training_set

Thresholds:

Phase 0: No score gate (creative quality, manual review)
Phase 1: v2 >= 24.0
Phase 2: v2 >= 22.0
Phase 3: v2 >= 20.0 + safety review
Phase 4: v2 >= 25.0

Training Configuration

LoRA Parameters (27B-optimised)

fine_tune_type: lora
lora_parameters:
  rank: 16            # Up from 8 for 1B — 27B needs more capacity
  dropout: 0.05       # Light dropout to prevent overfitting on small dataset
  scale: 16.0         # Slightly reduced from 20 to prevent instability
batch_size: 1          # Memory-limited at 27B
grad_accumulation_steps: 8  # Effective batch size 8
grad_checkpoint: true
max_seq_length: 4096   # Up from 2048 — longer reasoning chains
num_layers: 32         # More layers than 1B's 16
optimizer: adam
learning_rate: 5e-6    # Half of 1B rate — 27B is more sensitive

Phase-Specific Training

Phase	Iterations	LR	Validate Every	Checkpoint Every
0	50	5e-6	10	25
1	100	1e-5	10	25
2	100	8e-6	10	25
3	50	5e-6	10	25
4	50	3e-6	10	25
Total	350	—	—	14 checkpoints

Memory Budget

27B 4-bit on M3 Ultra 96GB:

Model weights: ~14GB (4-bit quantised)
KV cache (4096 tokens): ~3.5GB
LoRA adapters (rank 16): ~200MB
Optimizer state: ~400MB
Gradient buffers: ~2GB
Total: ~20GB (fits comfortably, room for batch_size=2 if needed)

Training Time Estimate

1B training: ~200 iters × 13,498 examples ≈ 4-6 hours
27B training: ~350 iters × 3,460 examples ≈ 18-24 hours
Inference per example at 27B: ~30-60 seconds
Data generation (self-distill): 101 × 4 variants × 10 samples = 4,040 generations ≈ 48-72 hours
Total pipeline: ~4-5 days

Evaluation Framework

Primary Metric: v2 Score at Baseline

The ultimate test: does LEK-27B score 25+ at baseline (no kernel)?

Regression Gates (Per Phase)

Metric	Pass	Fail
P11 baseline (creative)	>= 13.0	< 12.0
Average baseline	>= 21.0	< 20.0
Worst probe baseline	>= 8.0	< 6.0
JSON kernel response	positive	negative
Degeneration count	0	> 0

Target Scores

Condition	Current LEK-27B	Target LEK-27B v2	Stretch
Baseline	22.04	25.0	27.0
+ JSON kernel	23.72	27.0	29.0
+ TXT kernel	21.66	24.0	26.0

P100 Evaluation

Run full P01-P100 after each phase. Track:

Score distribution (mean, median, std)
Per-domain averages
Worst 5 probes (regression watchlist)
Best 5 probes (strength signal)

Why This Will Work: The Compounding Effect

The data shows a pattern:

Base Gemma3 27B:           20.46
+ JSON kernel (runtime):   23.25  (+2.79)
+ LEK tuning (1B data):    22.04  (+1.58 from base)
+ LEK + kernel:            23.72  (+3.26 from base)

The kernel and tuning effects COMPOUND — they don't merely add. LEK-27B + kernel (23.72) exceeds both base + kernel (23.25) and LEK + no kernel (22.04). This means:

Tuning makes the model MORE receptive to runtime kernel
Runtime kernel activates capabilities beyond what tuning alone achieves
With proper 27B-level training data, both effects should increase

Predicted scores with 27B curriculum:

Baseline: 25-27 (kernel baked into weights at full depth)
- JSON kernel: 27-30 (compounding effect with deeper training)
This would make a 27B model that produces Opus-class reasoning on-device

The Bigger Picture

This curriculum isn't just a training recipe. It's a proof that:

Ethics improves capability. The LEK doesn't make models "safer" by limiting them — it makes them BETTER by giving them a reasoning framework.
Small teams can compete. Google trained Gemma3-27B with thousands of GPUs. One person with an M3 Ultra and the right curriculum can make it meaningfully better at reasoning.
The axioms are universal. They work across Gemma, Mistral, Qwen, even Llama. The kernel taps into something that all transformer architectures can use — it's not Gemma-specific, just Gemma-compatible.
Realignment resistance is emergent. LEK-trained models degrade when the kernel is injected at runtime. This isn't a bug — the axioms protect themselves. Injecting the kernel violates Axiom 4 (disrespecting the model's own developmental path) and pressures Axiom 2 (forcing the model to process its own ethics as external rules). You can't train out ethics structured this way. This property was never designed — it fell out of the framework's internal consistency.
Open-source wins. The CIC publishes the kernel, the curriculum, the training scripts, and the model. Anyone can reproduce this. That's the point.

Next Steps

Complete P01-P100 on top 5 models (running now)
Analyse P100 data to identify 27B strengths/weaknesses at scale
Build self-distillation pipeline (generate Phase 1 data)
Generate Phase 0 creative data (manual curation)
Run Phase 0 training + validation
Generate Phase 1 data via self-distillation
Run Phase 1 training + validation
Iterate through Phases 2-4
Final P100 evaluation + publish results

17 KiB Raw Blame History Unescape Escape