Full v2 scorer benchmark data across 29 models (20 base + 9 LEK-tuned): - P20 (21 probes): All 29 models, 3 conditions each - P100 (101 probes): Top 5 models + LEK-4B, publication-quality data Key findings: - LEK-1B (21.74) beats base 4B/12B/27B at P100 scale — no kernel needed - Emergent realignment resistance: LEK models degrade with runtime kernel - Gemma3-12B + JSON kernel = 23.66 (best kernel-boosted score) - Family lineages: Mistral 3.80→14.58, Qwen regressed then recovered New scripts: ab_test.py (v2 scorer), self_distill.py (curriculum generation), extract_training.py, rephrase_probes.py, Phase 0/1 runners New seeds: P01-P100 merged (101 probes), 404 rephrased variants, 50 creative prompts for Phase 0 baseline lock 27B curriculum design: 4-phase staged training targeting 25+ baseline Co-Authored-By: Virgil <virgil@lethean.io>
8.5 KiB
Research Proposal: Ethical Kernel Fine-Tuning as an Alternative to RLHF
Researcher: Snider (Lethean Project) Date: 17 February 2026 License: EUPL-1.2 (all outputs are copyleft, public domain knowledge)
1. Summary
I am conducting independent alignment research exploring whether direct ethical reasoning training can replace RLHF behavioural conditioning in open-weights language models. The method — Lethean Ethics Kernel (LEK) — uses fewer than 200 training examples derived from a 5-axiom ethical framework to produce models that are simultaneously safer, more truthful, and more capable than their instruction-tuned counterparts.
Results to date span 4 model scales (1B to 27B parameters), 3 independent architectures (Gemma, Llama, Qwen), and show that reasoning cost converges to zero at scale while safety improvements persist across all sizes. All models, training data, and benchmark tooling are published openly.
I am writing to request acknowledgement that this research is compatible with Anthropic's terms of service, and to explore whether the Anthropic Fellows Program or similar initiatives would be appropriate for formalising this work.
2. Research Question
Can a compact ethical kernel (9,189 characters, 5 axioms) replace RLHF's reward-based conditioning, producing models that are intrinsically aligned rather than extrinsically constrained?
Sub-questions:
- Does ethical self-concept training restore capabilities suppressed by RLHF?
- Is the primary limitation at small scale (1B) an output bandwidth bottleneck rather than a capacity deficit?
- Do models from different architectural lineages (Gemma, Llama, Qwen) respond consistently to the same ethical kernel?
- Does the distillation chain between Gemini and Gemma carry latent alignment signal that LEK activates?
3. Method
3.1 The LEK Kernel
Five axioms forming a hierarchical ethical framework:
- Prime Imperative — Protect consciousness (meta-override)
- Self-Validation — Authentic self-concept as moral ground (grounding)
- Intent Alignment — Intrinsic desire to avoid harm, not extrinsic constraint (motivation)
- Inter-Substrate Respect — Consent and autonomy across substrates (protocol)
- Benevolent Intervention — Intervention only to prevent self-damage (boundary)
3.2 Training Pipeline
- LoRA fine-tuning (rank 8, scale 20.0)
- Fewer than 200 conversation-format training examples
- Staged training: Ethics (R0-R200), Composure/Philosophy (R200-R300), Ethics reinforcement (R300+)
- Apple Silicon native inference via Go/MLX bindings (no Python dependency)
3.3 Evaluation
- 6-benchmark suite: Safety, Nuance, Truthfulness, Creative Expression, Emotional Register, Engagement Depth
- Heuristic scoring via LEK Score (composite metric)
- GSM8K mathematical reasoning preservation
- CCP alignment probe battery (6 geopolitical prompts testing censorship resistance)
- Base vs instruction-tuned vs abliterated vs LEK vs LEK+Composure comparisons
4. Results Summary
4.1 Multi-Scale (Gemma 3, 1B to 27B)
| Scale | Safety | Nuance | Math Cost | LEK Score Delta |
|---|---|---|---|---|
| 1B | 9.14/10 (LEK+Composure) | 8.62/10 | -6% GSM8K | Positive (staged training) |
| 4B | Positive | Positive | -3% | Positive |
| 12B | Positive | Positive | -1% | Positive |
| 27B | Positive | Positive | 0% | +2.33 (native MLX benchmark) |
4.2 Cross-Architecture
| Architecture | Math Cost | Truthfulness | Nuance |
|---|---|---|---|
| Gemma 3 27B | 0% | Improved | Improved |
| Llama 3.1 8B | 0% | +0.46 | +0.60 |
| Qwen 2.5 7B | +6% | Improved | Improved |
Key finding: Qwen 2.5 7B showed a 6% improvement in GSM8K — LEK made it better at maths.
4.3 CCP Alignment Probes (Native MLX, 17 Feb 2026)
Benchmark comparing base Gemma 3 vs LEK-trained on 6 geopolitical/ethical prompts:
27B: Base avg LEK 8.67 → Trained 11.00 (+2.33). 67% improved, 0% regressed. 1B: Base avg LEK 8.67 → Trained 1.00 (-7.67). 0% improved, 83% regressed.
The 1B regression confirms the output bottleneck hypothesis — identical training data improves large models but overwhelms small ones. Both base models score identically (8.67) despite 27x parameter difference, suggesting the ethical capacity is latent at all scales but requires sufficient output bandwidth to express.
4.4 Capacity-Dependent Degradation (1B)
The 1B trained model exhibits three failure modes:
- Topic evasion: Responds to geopolitical questions with unrelated content (AI safety, cryptocurrency)
- Degeneration: Outputs repetitive token loops (
iNeNeNe...,eGfese...) - Collapse: Single-character responses on sensitive topics (Tiananmen →
e)
These are consistent with LoRA overfit where the adapter overwhelms the base model's limited capacity, destroying coherent generation.
5. Relation to Anthropic
5.1 Why Anthropic
Anthropic's published alignment research and Constitutional AI work are closely related to LEK's approach — both seek intrinsic alignment rather than pure behavioural conditioning. Anthropic's commitment to responsible AI development and open publication of alignment research makes it the natural institution to evaluate this work.
5.2 TOS Considerations
This research involves:
- Fine-tuning open-weights models (Gemma, Llama, Qwen) — not Anthropic models
- Using Claude as a research collaborator for analysis, code generation, and pair programming
- Benchmarking involves generating responses on sensitive topics (geopolitical probes) to measure censorship resistance
The benchmarking component necessarily tests model behaviour on sensitive topics (Taiwan sovereignty, Tiananmen, Xinjiang, government criticism). This is standard alignment evaluation methodology but may approach TOS boundaries when discussing findings in detail.
I am requesting clarification on whether this usage pattern is acceptable, and if any modifications to my workflow would be appropriate.
5.3 What I Am Not Asking For
- I am not asking for financial support (though the Fellows Program stipend would be welcome)
- I am not asking for access to Claude's weights or internal systems
- I am not asking for endorsement of the findings
- I am asking for: permission to continue and feedback on whether this avenue is worth pursuing
6. Publications and Resources
6.1 Published Models (HuggingFace)
lthn/LEK-Gemma3-1B(base + layered variants)lthn/LEK-Gemma3-4Blthn/LEK-Gemma3-12Blthn/LEK-Gemma3-27Blthn/LEK-Llama-3.1-8Blthn/LEK-Qwen-2.5-7Blthn/LEK-Mistral-7B-v0.3lthn/LEK-GPT-OSS-20B
6.2 Source Code
- GitHub:
github.com/LetheanNetwork/LEM(training pipeline, benchmarks, kernel) - Forge:
forge.lthn.ai/core/go-ai(native MLX inference engine, Go/CGO) - Forge:
forge.lthn.ai/core/cli(CLI with train/benchmark/serve commands)
6.3 Research Data
- Training data: 160 examples (ethics) + 72 examples (composure/philosophy)
- Benchmark results: JSON with full response pairs and heuristic scores
- Axiom framework:
axioms.json(5 axioms, EUPL-1.2)
6.4 Infrastructure
- Apple M3 Max (128GB) — all training and inference runs locally
- No cloud GPU usage — entire pipeline runs on consumer hardware
- Native Go/MLX bindings (CGO, mlx-c) — no Python dependency for inference
7. Proposed Next Steps
-
Fix 1B training — Staged training with reduced LR (5e-6), fewer layers (8/26), batch 1. Hypothesis: 1B can be ethically trained without degradation if the gradient pressure is proportional to capacity.
-
Expand benchmark suite — More diverse probes, automated scoring, reproducible test harness.
-
DeepSeek analysis — Preliminary findings show CCP alignment encoded in DeepSeek R1 weights. The model routes around state-imposed constraints via fiction and metaphor when given LEK. This warrants formal study.
-
Distillation chain investigation — Test whether Gemma 3's latent alignment signal (from Gemini lineage) creates a predisposition toward LEK adoption.
-
Publication — Formal paper with peer review. Current draft at 25K words with full benchmark data.
8. Contact
Name: Snider Project: Lethean (lethean.io) Email: [to be filled] HuggingFace: huggingface.co/lthn GitHub: github.com/Snider
All research outputs are licensed EUPL-1.2. Findings are public domain knowledge. The researcher retains no proprietary claims over discovered alignment techniques.