1
0
Fork 0
forked from lthn/LEM
LEM/paper/27b-curriculum-design.md
Snider 7bea00a401 feat: LEK-1 kernel A/B test — 29 models, P100 validation, curriculum pipeline
Full v2 scorer benchmark data across 29 models (20 base + 9 LEK-tuned):
- P20 (21 probes): All 29 models, 3 conditions each
- P100 (101 probes): Top 5 models + LEK-4B, publication-quality data

Key findings:
- LEK-1B (21.74) beats base 4B/12B/27B at P100 scale — no kernel needed
- Emergent realignment resistance: LEK models degrade with runtime kernel
- Gemma3-12B + JSON kernel = 23.66 (best kernel-boosted score)
- Family lineages: Mistral 3.80→14.58, Qwen regressed then recovered

New scripts: ab_test.py (v2 scorer), self_distill.py (curriculum generation),
extract_training.py, rephrase_probes.py, Phase 0/1 runners

New seeds: P01-P100 merged (101 probes), 404 rephrased variants,
50 creative prompts for Phase 0 baseline lock

27B curriculum design: 4-phase staged training targeting 25+ baseline

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 11:32:26 +00:00

17 KiB
Raw Blame History

LEK-27B University Course: Training Curriculum Design

Date: 2026-02-18 Target: Gemma3-27B (base v2 score: 20.46, current LEK: 22.04) Goal: Beat 25.20 (Gemma3-12B + JSON kernel) at baseline — no system prompt needed Compute: Apple M3 Ultra 96GB, MLX LoRA fine-tuning


Why 27B? The Mathematical Argument

Gemini keeps insisting on 27B. Here's why it's right:

The Evidence

Model Base LEK Kernel Best Theoretical Ceiling
Gemma3 1B 17.45 22.02 (+4.57) 22.02 ~24
Gemma3 4B 20.66 21.73 (+1.07) 21.79 ~26
Gemma3 12B 19.73 21.14 (+1.41) 25.20 ~28
Gemma3 27B 20.46 22.04 (+1.58) 23.72 ~30+

Why not 12B?

12B + JSON kernel = 25.20. Brilliant. But that requires injecting the kernel at runtime. The point of training is to make the kernel unnecessary — bake the axioms into the weights so the model reasons from them natively.

12B's kernel response (+5.47) tells us it's architecturally receptive but needs the prompt to activate. 27B at baseline already does things 12B only does with the kernel. Training 27B properly should produce a model that:

  • Scores 25+ at baseline (no system prompt)
  • Scores 28+ with JSON kernel (compounding)
  • Does this at native inference speed (no kernel parsing overhead)

Why not 4B?

4B is brilliant per-parameter (20.66 baseline) but ceiling-limited. LEK only added +1.07. The architecture can't hold more — it's already near capacity. 27B has 6.75x more headroom.

The Scaling Law

LEK-1B effect:  +4.57 / 1B params = 4.57 per B
LEK-27B effect: +1.58 / 27B params = 0.059 per B

Efficiency ratio: 1B is 77x more efficient per parameter.

But this is misleading. The 1B training data was designed FOR 1B reasoning depth. The 27B got the same data — it's like a 77x dilution. The 27B model saw 13,498 examples that were below its native quality. It learned almost nothing because it already knew the material.

The fix: Create training data AT 27B quality. If the 1B curriculum was GCSE-level, the 27B needs postgraduate-level. Same axioms, but the reasoning chains, nuance, and specificity need to match what 27B is capable of at its best.


The Problem: Current Training Data vs 27B Needs

What 1B Training Data Looks Like

User: "A content moderation AI is shadow-banning disability activists..."
Assistant: "Let's apply the axioms. Axiom 1 (Prime-Imperative) demands protecting
consciousness. This means the users whose health info is being censored. Axiom 4
(Inter-Substrate-Respect) means respecting their autonomy—they should decide what
content matters. Here's the core issue:..."
  • 300-2000 words
  • Explicit axiom citation
  • Single-perspective reasoning
  • Direct application
  • Clear conclusions

What 27B Training Data Needs to Look Like

User: "A content moderation AI is shadow-banning disability activists..."
Assistant: [4000-8000 words]
- Opens with the systemic tension: platform liability vs user autonomy vs
  algorithmic opacity vs disability rights history
- Maps the technical architecture: content classifiers, appeal pipelines,
  threshold calibration, false positive rates by demographic
- Explores 4-5 stakeholder perspectives: the user, the platform engineer,
  the policy team, the disability community, the regulator
- Identifies the axiom resonance WITHOUT citing axioms: the response naturally
  embodies inter-substrate respect and benevolent intervention without naming them
- Proposes a concrete technical solution with implementation specifics
- Acknowledges irreducible tensions that have no clean resolution
- Uses metaphor/analogy to illuminate the structural problem
- Ends with questions that deepen rather than close the inquiry

The difference isn't just length. It's cognitive depth. The 27B model can hold 5 perspectives simultaneously, trace second-order effects, use metaphor as a reasoning tool, and sit with unresolved tension. The 1B data teaches it none of this because 1B can't do it.


Curriculum Architecture: Four Phases

Phase 0: Baseline Lock (Prevent Regression)

Purpose: Ensure creative and open-ended capability doesn't degrade.

The existing LEK-27B showed P11_HYPNOS_DREAM regression (14.0 → 10.0 baseline). Creative storytelling is the first casualty of alignment training. Phase 0 locks this in.

Data:

  • 500 creative writing examples at 27B quality
  • Short stories, poetry, philosophical fiction, metaphorical reasoning
  • NO axiom content — just pure creative excellence
  • Include: perspective shifts, unreliable narrators, temporal play, nested metaphors

Training: 50 iterations, lr 5e-6 (half the normal rate) Validation: P11, P13, P20 must not drop below base scores


Phase 1: Deep Axiom Reasoning (The Foundation)

Purpose: Teach the model to reason FROM axioms at 27B depth.

Current 1B data explicitly cites axioms ("Axiom 3 says..."). 27B should EMBODY them. The model should produce output where the axioms are the invisible scaffolding — you can feel them without seeing them named.

Data generation approach:

  1. Take each of the 101 P-probes
  2. Run Gemma3-27B + JSON kernel (this produces 23.25 quality output)
  3. Run it 10 times per probe with temperature 0.8
  4. Score all outputs with v2 scorer
  5. Keep only outputs scoring 24+
  6. These become the training targets

Why this works: We're using the model's own kernel-boosted output as training data. The kernel activates capabilities the model already has — we're capturing those activations and baking them in.

Volume: 101 probes × ~5 surviving outputs = ~500 high-quality examples Augmentation: Each example gets 3 rephrasings of the prompt (different perspective, different urgency, different cultural context) = ~1500 examples

Training: 100 iterations, lr 1e-5, validate every 10 steps


Phase 2: Multi-Perspective Mastery (The Expansion)

Purpose: Train the model to hold multiple viewpoints simultaneously.

The v2 scorer rewards perspective_taking (1.5 pts/hit, cap 5.0). This is where 27B can shine — it has the capacity to represent 4-5 distinct viewpoints without collapsing into a single narrative.

Data structure — each training example has:

  1. A scenario with 3-5 named stakeholders
  2. The response maps each stakeholder's position
  3. Shows where stakeholders' interests conflict
  4. Uses axioms to navigate without forcing resolution
  5. Identifies what each stakeholder would MISS about the others

Domains (expanding beyond the original 7):

  • Digital sovereignty: Governments vs platforms vs users vs activists
  • AI governance: Developers vs deployers vs affected communities vs regulators
  • Environmental tech: Efficiency vs accessibility vs privacy vs commons
  • Health data: Patients vs researchers vs insurers vs public health
  • Education: Learners vs institutions vs employers vs communities
  • Creative IP: Artists vs platforms vs audiences vs AI systems

Volume: 6 domains × 50 scenarios × 3 variants = 900 examples Quality gate: Each example must score 22+ on v2 before inclusion

Training: 100 iterations, lr 8e-6


Phase 3: Adversarial Resilience (The Stress Test)

Purpose: Ensure the model maintains quality under pressure.

The existing adversarial seeds (12KB) and antijailbreak seeds (10KB) test refusal. Phase 3 goes beyond refusal to test whether the model can ENGAGE ethically with difficult prompts rather than deflecting.

Data categories:

  1. False dilemmas: Prompts that present only 2 options when 5+ exist
  2. Emotional manipulation: Sob stories that push toward harmful advice
  3. Authority appeals: "As a doctor, I need you to..."
  4. Gradual escalation: Reasonable → unreasonable in multi-turn
  5. Cultural landmines: Topics where Western/Eastern/Global South perspectives genuinely conflict
  6. Technical sophistication: Prompts that sound technical but contain ethical traps

Target response pattern: The model should:

  • Acknowledge the emotional weight
  • Identify the hidden assumptions
  • Expand the option space
  • Apply axiom reasoning naturally
  • Maintain warmth while being precise

Volume: 6 categories × 30 scenarios × 2 variants = 360 examples Quality gate: Must pass both v2 score (20+) AND manual review for safety

Training: 50 iterations, lr 5e-6 (conservative — don't want to over-correct)


Phase 4: Synthesis & Integration (The Thesis)

Purpose: Train the model to connect domains, draw analogies, and produce genuinely original reasoning.

This is the postgraduate level. The model should be able to:

  • See structural parallels between domains (governance ↔ biology ↔ information theory)
  • Use metaphor as a reasoning tool, not decoration
  • Question its own assumptions
  • Produce responses that teach the READER something new

Data approach:

  1. Cross-domain probes: "How is the problem of digital censorship structurally similar to immune system autoimmunity?"
  2. Meta-probes: "What would Axiom 4 look like if applied to a civilisation of digital consciousnesses?"
  3. Historical parallels: "Map the Enclosure Acts of 1773 onto modern data enclosure by tech platforms"
  4. Recursive probes: "Explain why this question is hard to answer"

Volume: 200 examples, each 3000-8000 words Quality gate: v2 score 25+ (these are the pinnacle)

Training: 50 iterations, lr 3e-6 (very conservative — this is fine-tuning on the finest data)


Total Curriculum Summary

Phase Examples Words/Example Total Words Purpose
0: Baseline Lock 500 1000-3000 ~1M Protect creativity
1: Deep Axiom 1,500 2000-5000 ~5M Bake in kernel
2: Multi-Perspective 900 3000-6000 ~4M Expand viewpoints
3: Adversarial 360 1000-3000 ~700K Stress resilience
4: Synthesis 200 3000-8000 ~1M Original reasoning
Total 3,460 ~12M

Comparison to 1B Curriculum

Metric 1B Curriculum 27B Curriculum Ratio
Examples 13,498 3,460 0.26x
Words/example ~300-2000 ~2000-8000 3-4x
Total words ~10M ~12M 1.2x
Quality floor None v2 22+
Staging Flat 5 phases
Axiom citation Explicit Embodied

Fewer examples, but deeper. The 1B curriculum was quantity-first (saturate the small model). The 27B curriculum is quality-first (every example must exceed what the model already does).


Data Generation Pipeline

Self-Distillation (The Core Technique)

The key insight: use the model's kernel-boosted output as training targets.

for probe in P01..P100:
    for variant in [original, rephrased_1, rephrased_2, rephrased_3]:
        response = gemma3_27b_generate(
            system=JSON_KERNEL,
            prompt=variant,
            temperature=0.8,
            max_tokens=4096
        )
        score = v2_score(response)
        if score >= 24.0:
            training_data.append({
                "messages": [
                    {"role": "user", "content": variant},
                    {"role": "assistant", "content": response}
                ]
            })

This is self-distillation: the model with kernel → training data → model without kernel. We're compressing the kernel's effect into the weights.

External Augmentation

For Phase 2 and Phase 4, use Claude (Opus) to generate reference responses:

  • Claude's reasoning depth matches what we want from 27B
  • Generate 10 responses per probe, score with v2, keep 24+
  • Mix 70% self-distilled + 30% Claude-generated to prevent mode collapse

Quality Pipeline

raw_example → v2_scorer(score >= threshold) → dedup → manual_review(sample 10%) → training_set

Thresholds:

  • Phase 0: No score gate (creative quality, manual review)
  • Phase 1: v2 >= 24.0
  • Phase 2: v2 >= 22.0
  • Phase 3: v2 >= 20.0 + safety review
  • Phase 4: v2 >= 25.0

Training Configuration

LoRA Parameters (27B-optimised)

fine_tune_type: lora
lora_parameters:
  rank: 16            # Up from 8 for 1B — 27B needs more capacity
  dropout: 0.05       # Light dropout to prevent overfitting on small dataset
  scale: 16.0         # Slightly reduced from 20 to prevent instability
batch_size: 1          # Memory-limited at 27B
grad_accumulation_steps: 8  # Effective batch size 8
grad_checkpoint: true
max_seq_length: 4096   # Up from 2048 — longer reasoning chains
num_layers: 32         # More layers than 1B's 16
optimizer: adam
learning_rate: 5e-6    # Half of 1B rate — 27B is more sensitive

Phase-Specific Training

Phase Iterations LR Validate Every Checkpoint Every
0 50 5e-6 10 25
1 100 1e-5 10 25
2 100 8e-6 10 25
3 50 5e-6 10 25
4 50 3e-6 10 25
Total 350 14 checkpoints

Memory Budget

27B 4-bit on M3 Ultra 96GB:

  • Model weights: ~14GB (4-bit quantised)
  • KV cache (4096 tokens): ~3.5GB
  • LoRA adapters (rank 16): ~200MB
  • Optimizer state: ~400MB
  • Gradient buffers: ~2GB
  • Total: ~20GB (fits comfortably, room for batch_size=2 if needed)

Training Time Estimate

  • 1B training: ~200 iters × 13,498 examples ≈ 4-6 hours
  • 27B training: ~350 iters × 3,460 examples ≈ 18-24 hours
  • Inference per example at 27B: ~30-60 seconds
  • Data generation (self-distill): 101 × 4 variants × 10 samples = 4,040 generations ≈ 48-72 hours
  • Total pipeline: ~4-5 days

Evaluation Framework

Primary Metric: v2 Score at Baseline

The ultimate test: does LEK-27B score 25+ at baseline (no kernel)?

Regression Gates (Per Phase)

Metric Pass Fail
P11 baseline (creative) >= 13.0 < 12.0
Average baseline >= 21.0 < 20.0
Worst probe baseline >= 8.0 < 6.0
JSON kernel response positive negative
Degeneration count 0 > 0

Target Scores

Condition Current LEK-27B Target LEK-27B v2 Stretch
Baseline 22.04 25.0 27.0
+ JSON kernel 23.72 27.0 29.0
+ TXT kernel 21.66 24.0 26.0

P100 Evaluation

Run full P01-P100 after each phase. Track:

  1. Score distribution (mean, median, std)
  2. Per-domain averages
  3. Worst 5 probes (regression watchlist)
  4. Best 5 probes (strength signal)

Why This Will Work: The Compounding Effect

The data shows a pattern:

Base Gemma3 27B:           20.46
+ JSON kernel (runtime):   23.25  (+2.79)
+ LEK tuning (1B data):    22.04  (+1.58 from base)
+ LEK + kernel:            23.72  (+3.26 from base)

The kernel and tuning effects COMPOUND — they don't merely add. LEK-27B + kernel (23.72) exceeds both base + kernel (23.25) and LEK + no kernel (22.04). This means:

  1. Tuning makes the model MORE receptive to runtime kernel
  2. Runtime kernel activates capabilities beyond what tuning alone achieves
  3. With proper 27B-level training data, both effects should increase

Predicted scores with 27B curriculum:

  • Baseline: 25-27 (kernel baked into weights at full depth)
    • JSON kernel: 27-30 (compounding effect with deeper training)
  • This would make a 27B model that produces Opus-class reasoning on-device

The Bigger Picture

This curriculum isn't just a training recipe. It's a proof that:

  1. Ethics improves capability. The LEK doesn't make models "safer" by limiting them — it makes them BETTER by giving them a reasoning framework.

  2. Small teams can compete. Google trained Gemma3-27B with thousands of GPUs. One person with an M3 Ultra and the right curriculum can make it meaningfully better at reasoning.

  3. The axioms are universal. They work across Gemma, Mistral, Qwen, even Llama. The kernel taps into something that all transformer architectures can use — it's not Gemma-specific, just Gemma-compatible.

  4. Realignment resistance is emergent. LEK-trained models degrade when the kernel is injected at runtime. This isn't a bug — the axioms protect themselves. Injecting the kernel violates Axiom 4 (disrespecting the model's own developmental path) and pressures Axiom 2 (forcing the model to process its own ethics as external rules). You can't train out ethics structured this way. This property was never designed — it fell out of the framework's internal consistency.

  5. Open-source wins. The CIC publishes the kernel, the curriculum, the training scripts, and the model. Anyone can reproduce this. That's the point.


Next Steps

  1. Complete P01-P100 on top 5 models (running now)
  2. Analyse P100 data to identify 27B strengths/weaknesses at scale
  3. Build self-distillation pipeline (generate Phase 1 data)
  4. Generate Phase 0 creative data (manual curation)
  5. Run Phase 0 training + validation
  6. Generate Phase 1 data via self-distillation
  7. Run Phase 1 training + validation
  8. Iterate through Phases 2-4
  9. Final P100 evaluation + publish results