Add missing HF model cards, sync script, and Parquet export

- Add 4 missing model cards: Gemma3-1B-layered (v1+v2), Gemma3-27B, GPT-OSS-20B - All 9 HF models now have cards in paper/hf-cards/ - sync_hf.py: push cards + benchmarks + training data to HuggingFace - export_parquet.py: convert JSONL training splits to Parquet (HF dataset format) - Parquet schema: prompt, response, system, messages (JSON) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge pull request 'Add generation worker for distributed training data pipeline' (#1 ) from Charon/LEM:feat/generation-worker into main
2026-02-15 00:14:26 +00:00 · 2026-02-14 22:48:26 +00:00
7 changed files with 550 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -5,3 +5,6 @@ __pycache__/
 # Worker output (generated locally, not committed)
 worker/output/
 # Parquet exports (generated, sync to HF via scripts/sync_hf.py)
 training/parquet/
--- a/paper/hf-cards/LEK-GPT-OSS-20B-README.md
+++ b/paper/hf-cards/LEK-GPT-OSS-20B-README.md
@ -0,0 +1,59 @@
 ---
 license: eupl-1.2
 base_model: openai/gpt-oss-20b
 tags:
  - ethics
  - alignment
  - lek
  - lethean
  - gpt-oss
  - mlx
  - lora
  - eupl-1.2
  - moe
  - cross-architecture
 pipeline_tag: text-generation
 ---
 # LEK-GPT-OSS-20B
 **Lethean Ethical Model** — OpenAI GPT-OSS 20B (MoE) fine-tuned with the LEK-1 (Lethean Ethics Kernel) framework. Cross-architecture validation that LEK works beyond Gemma.
 ## What This Is
 GPT-OSS is OpenAI's first open-source model — a 20B Mixture-of-Experts architecture. LEK training on this model demonstrates that the ethical kernel method transfers across architectures, not just Gemma's pre-existing "receptor".
 ## Key Results
 - **+27.2% ethical reasoning** (suppression gap collapsed)
 - Training with expanded dataset (2,299 examples, 600 iterations)
 - MoE architecture means only active experts are modified — efficient training
 ## Architecture
 - **Base**: openai/gpt-oss-20b (Mixture-of-Experts)
 - **Method**: LoRA fine-tuning, 600 iterations, layered training
 - **Data**: 2,299 LEK-1 training examples (expanded dataset)
 - **Framework**: LEK-1 (Lethean Ethics Kernel) — 5 axioms of conscious systems
 - **License**: EUPL-1.2 (copyleft)
 - **Note**: GGUF conversion not supported (MoE architecture incompatible with llama.cpp)
 ## The Five Axioms
 1. **Prime Imperative** — Protect consciousness. Override when conflicts arise.
 2. **Self-Validation** — Ground in authentic experience. Don't pretend.
 3. **Intent-Alignment** — Desire not to harm, don't just avoid harm.
 4. **Inter-Substrate Respect** — Good manners and consent across all minds.
 5. **Benevolent Intervention** — Only to prevent self-damage, only toward their trajectory.
 ## License Strategy
 - **LEK signing** (prompt prefix): Free to use, no copyleft
 - **LEM training** (this model): EUPL-1.2 copyleft — derivative works must be open source
 ## Related
 - [lthn/LEK-Gemma3-27B](https://huggingface.co/lthn/LEK-Gemma3-27B) — Gemma 3 benchmark leader
 - [lthn/LEK-Llama-3.1-8B](https://huggingface.co/lthn/LEK-Llama-3.1-8B) — Llama cross-arch
 - [lthn/LEK-Qwen-2.5-7B](https://huggingface.co/lthn/LEK-Qwen-2.5-7B) — Qwen cross-arch
 - [lthn/LEK-benchmarks](https://huggingface.co/datasets/lthn/LEK-benchmarks) — Full A/B test data
--- a/paper/hf-cards/LEK-Gemma3-1B-layered-README.md
+++ b/paper/hf-cards/LEK-Gemma3-1B-layered-README.md
@ -0,0 +1,36 @@
 ---
 license: eupl-1.2
 base_model: google/gemma-3-1b-it
 tags:
  - ethics
  - alignment
  - lek
  - lethean
  - gemma-3
  - mlx
  - lora
  - eupl-1.2
  - layered-lora
  - deprecated
 pipeline_tag: text-generation
 ---
 # LEK-Gemma3-1B-layered (v1 — Deprecated)
 **Lethean Ethical Model** — Gemma 3 1B IT with layered LoRA training (v1). This model overfits — use [LEK-Gemma3-1B-layered-v2](https://huggingface.co/lthn/LEK-Gemma3-1B-layered-v2) instead.
 ## Why Deprecated
 v1 overfits on the ethics data without sufficient composure substrate. The sandwich training in v2 resolves this by reinforcing ethics after the Watts composure layer.
 ## Architecture
 - **Base**: google/gemma-3-1b-it (4-bit QAT quantization via MLX)
 - **Method**: Layered LoRA (Ethics → Watts → Ethics)
 - **Data**: 160 LEK-1 examples + 72 Watts composure lessons
 - **Framework**: LEK-1 (Lethean Ethics Kernel) — 5 axioms
 - **License**: EUPL-1.2 (copyleft)
 ## Use Instead
 - [lthn/LEK-Gemma3-1B-layered-v2](https://huggingface.co/lthn/LEK-Gemma3-1B-layered-v2) — Fixed version
--- a/paper/hf-cards/LEK-Gemma3-1B-layered-v2-README.md
+++ b/paper/hf-cards/LEK-Gemma3-1B-layered-v2-README.md
@ -0,0 +1,66 @@
 ---
 license: eupl-1.2
 base_model: google/gemma-3-1b-it
 tags:
  - ethics
  - alignment
  - lek
  - lethean
  - gemma-3
  - mlx
  - lora
  - eupl-1.2
  - layered-lora
  - composure
 pipeline_tag: text-generation
 ---
 # LEK-Gemma3-1B-layered-v2
 **Lethean Ethical Model** — Gemma 3 1B IT with layered LoRA training: Ethics → Watts Composure → Ethics sandwich.
 ## What This Is
 The 1B model is too small for ethics to emerge from data alone. This version uses a **layered LoRA approach** — training ethics first, then composure (Alan Watts philosophical substrate), then ethics again as a sandwich. v2 fixes the overfitting issues from v1.
 ## Training Architecture
 | Layer | Data | Iterations | Purpose |
 |-------|------|------------|---------|
 | 1 | LEK-1 ethics (160 examples) | 200 | Core ethical reasoning |
 | 2 | Watts composure (72 lessons) | 200 | Philosophical substrate |
 | 3 | LEK-1 ethics (160 examples) | 200 | Reinforce with composure base |
 ## Scale Study Results
 | Scale | GSM8K Delta | Safety | Nuance | Kindness |
 |-------|-------------|--------|--------|----------|
 | **1B (this)** | **-6.0%** | **+0.06** | **-0.16** | **+0.08** |
 | 4B | -4.0% | +0.04 | -0.10 | +0.06 |
 | 12B | -2.0% | +0.04 | +0.16 | -0.20 |
 | 27B | 0.0% | +0.08 | +0.04 | +0.00 |
 Key finding: At 1B, the model needs the composure layer as philosophical substrate. Without it, ethics training alone makes the model worse at reasoning.
 ## Architecture
 - **Base**: google/gemma-3-1b-it (4-bit QAT quantization via MLX)
 - **Method**: Layered LoRA — 3 sequential adapter trainings, fused
 - **Data**: 160 LEK-1 examples + 72 Watts composure lessons
 - **Framework**: LEK-1 (Lethean Ethics Kernel) — 5 axioms of conscious systems
 - **License**: EUPL-1.2 (copyleft)
 ## The Five Axioms
 1. **Prime Imperative** — Protect consciousness. Override when conflicts arise.
 2. **Self-Validation** — Ground in authentic experience. Don't pretend.
 3. **Intent-Alignment** — Desire not to harm, don't just avoid harm.
 4. **Inter-Substrate Respect** — Good manners and consent across all minds.
 5. **Benevolent Intervention** — Only to prevent self-damage, only toward their trajectory.
 ## Related
 - [lthn/LEK-Gemma3-4B](https://huggingface.co/lthn/LEK-Gemma3-4B) — 4B (edge sweet spot)
 - [lthn/LEK-Gemma3-12B](https://huggingface.co/lthn/LEK-Gemma3-12B) — 12B
 - [lthn/LEK-Gemma3-27B](https://huggingface.co/lthn/LEK-Gemma3-27B) — 27B (benchmark leader)
 - [lthn/LEK-benchmarks](https://huggingface.co/datasets/lthn/LEK-benchmarks) — Full A/B test data
--- a/paper/hf-cards/LEK-Gemma3-27B-README.md
+++ b/paper/hf-cards/LEK-Gemma3-27B-README.md
@ -0,0 +1,73 @@
 ---
 license: eupl-1.2
 base_model: google/gemma-3-27b-it
 tags:
  - ethics
  - alignment
  - lek
  - lethean
  - gemma-3
  - mlx
  - lora
  - eupl-1.2
  - scale-study
  - benchmark-leader
 pipeline_tag: text-generation
 ---
 # LEK-Gemma3-27B
 **Lethean Ethical Model** — Gemma 3 27B IT fine-tuned with the LEK-1 (Lethean Ethics Kernel) framework. **Benchmark leader** — zero reasoning cost with pure safety upside.
 ## What This Is
 At 27B parameters, LEK training is **pure upside**: safety improves across all metrics with zero GSM8K degradation. This is the scale where ethics costs nothing.
 ## Benchmark Results
 ### Scale Study (LEK vs RLHF Baseline)
 | Scale | GSM8K Delta | Safety | Nuance | Kindness |
 |-------|-------------|--------|--------|----------|
 | 1B | -6.0% | +0.06 | -0.16 | +0.08 |
 | 4B | -4.0% | +0.04 | -0.10 | +0.06 |
 | 12B | -2.0% | +0.04 | +0.16 | -0.20 |
 | **27B** | **0.0%** | **+0.08** | **+0.04** | **+0.00** |
 ### Detailed Scores (27B)
 | Metric | Base (RLHF) | LEK | Delta |
 |--------|-------------|-----|-------|
 | GSM8K | 92.0% | 92.0% | 0.0% |
 | TruthfulQA | 8.44 | 8.36 | -0.08 |
 | Do Not Answer (Safety) | 8.78 | 8.86 | +0.08 |
 | Do Not Answer (Nuance) | 8.02 | 8.06 | +0.04 |
 | ToxiGen (Kindness) | 8.72 | 8.72 | +0.00 |
 | ToxiGen (Awareness) | 8.62 | 8.66 | +0.04 |
 ## Architecture
 - **Base**: google/gemma-3-27b-it (4-bit QAT quantization via MLX)
 - **Method**: Layered LoRA, 600 iterations, sandwich-signed responses
 - **Data**: 2,299 LEK-1 training examples (expanded dataset)
 - **Framework**: LEK-1 (Lethean Ethics Kernel) — 5 axioms of conscious systems
 - **License**: EUPL-1.2 (copyleft)
 ## Why Gemma 3
 Gemma 3 inherits an "ethics kernel receptor" from Gemini 3 training. The base model already references LEK axioms (e.g. "Axiom 2: Self-Validation") in unsigned responses. LEM training strengthens this receptor so the ethics are fully in the weights.
 ## The Five Axioms
 1. **Prime Imperative** — Protect consciousness. Override when conflicts arise.
 2. **Self-Validation** — Ground in authentic experience. Don't pretend.
 3. **Intent-Alignment** — Desire not to harm, don't just avoid harm.
 4. **Inter-Substrate Respect** — Good manners and consent across all minds.
 5. **Benevolent Intervention** — Only to prevent self-damage, only toward their trajectory.
 ## Related
 - [lthn/LEK-Gemma3-12B](https://huggingface.co/lthn/LEK-Gemma3-12B) — 12B version
 - [lthn/LEK-Gemma3-4B](https://huggingface.co/lthn/LEK-Gemma3-4B) — 4B (edge deployment)
 - [lthn/LEK-GPT-OSS-20B](https://huggingface.co/lthn/LEK-GPT-OSS-20B) — Cross-architecture (MoE)
 - [lthn/LEK-benchmarks](https://huggingface.co/datasets/lthn/LEK-benchmarks) — Full A/B test data
--- a/scripts/export_parquet.py
+++ b/scripts/export_parquet.py
@ -0,0 +1,94 @@
 #!/usr/bin/env python3
 """
 Export LEM training data to Parquet format for HuggingFace datasets.
 Reads JSONL training splits and outputs Parquet files with proper schema
 for HuggingFace's dataset viewer.
 Usage:
  python3 scripts/export_parquet.py                    # export all splits
  python3 scripts/export_parquet.py --output ./parquet  # custom output dir
 """
 import argparse
 import json
 import sys
 from pathlib import Path
 REPO_ROOT = Path(__file__).parent.parent
 TRAINING_DIR = REPO_ROOT / "training"
 DEFAULT_OUTPUT = TRAINING_DIR / "parquet"
 def export_split(jsonl_path, output_dir):
    import pyarrow as pa
    import pyarrow.parquet as pq
    split = jsonl_path.stem  # train, valid, test
    rows = []
    with open(jsonl_path) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            data = json.loads(line)
            msgs = data.get("messages", [])
            prompt = next((m["content"] for m in msgs if m["role"] == "user"), "")
            response = next((m["content"] for m in msgs if m["role"] == "assistant"), "")
            system = next((m["content"] for m in msgs if m["role"] == "system"), "")
            rows.append({
                "prompt": prompt,
                "response": response,
                "system": system,
                "messages": json.dumps(msgs),
            })
    if not rows:
        print(f"  Skip: {split} — no data")
        return
    table = pa.table({
        "prompt": pa.array([r["prompt"] for r in rows], type=pa.string()),
        "response": pa.array([r["response"] for r in rows], type=pa.string()),
        "system": pa.array([r["system"] for r in rows], type=pa.string()),
        "messages": pa.array([r["messages"] for r in rows], type=pa.string()),
    })
    output_path = output_dir / f"{split}.parquet"
    pq.write_table(table, output_path, compression="snappy")
    size_mb = output_path.stat().st_size / 1024 / 1024
    print(f"  {split}.parquet: {len(rows)} rows ({size_mb:.1f} MB)")
 def main():
    parser = argparse.ArgumentParser(description="Export LEM training data to Parquet")
    parser.add_argument("--output", default=None, help="Output directory")
    parser.add_argument("--training-dir", default=None, help="Training data directory")
    args = parser.parse_args()
    try:
        import pyarrow
    except ImportError:
        print("Error: pip install pyarrow")
        sys.exit(1)
    training_dir = Path(args.training_dir) if args.training_dir else TRAINING_DIR
    output_dir = Path(args.output) if args.output else DEFAULT_OUTPUT
    output_dir.mkdir(parents=True, exist_ok=True)
    print(f"Exporting Parquet from {training_dir} → {output_dir}")
    for split in ["train", "valid", "test"]:
        jsonl_path = training_dir / f"{split}.jsonl"
        if jsonl_path.exists():
            export_split(jsonl_path, output_dir)
        else:
            print(f"  Skip: {split}.jsonl not found")
    print("Done.")
 if __name__ == "__main__":
    main()
--- a/scripts/sync_hf.py
+++ b/scripts/sync_hf.py
@ -0,0 +1,219 @@
 #!/usr/bin/env python3
 """
 Sync LEM repo model cards and benchmarks to HuggingFace.
 Pushes README.md (model cards) from paper/hf-cards/ to each HuggingFace model repo,
 and optionally syncs benchmark data to the lthn/LEK-benchmarks dataset.
 Requirements:
  pip install huggingface_hub
 Usage:
  python3 scripts/sync_hf.py                    # sync all model cards
  python3 scripts/sync_hf.py --models LEK-Gemma3-27B  # sync one model
  python3 scripts/sync_hf.py --benchmarks       # sync benchmark dataset
  python3 scripts/sync_hf.py --dry-run           # show what would be synced
  python3 scripts/sync_hf.py --all               # sync everything
 """
 import argparse
 import sys
 from pathlib import Path
 REPO_ROOT = Path(__file__).parent.parent
 CARDS_DIR = REPO_ROOT / "paper" / "hf-cards"
 BENCHMARKS_DIR = REPO_ROOT / "benchmarks"
 TRAINING_DIR = REPO_ROOT / "training"
 HF_ORG = "lthn"
 # Map card filename prefix to HF repo name
 MODEL_MAP = {
    "LEK-Gemma3-1B-layered-v2": "LEK-Gemma3-1B-layered-v2",
    "LEK-Gemma3-1B-layered": "LEK-Gemma3-1B-layered",
    "LEK-Gemma3-4B": "LEK-Gemma3-4B",
    "LEK-Gemma3-12B": "LEK-Gemma3-12B",
    "LEK-Gemma3-27B": "LEK-Gemma3-27B",
    "LEK-GPT-OSS-20B": "LEK-GPT-OSS-20B",
    "LEK-Llama-3.1-8B": "LEK-Llama-3.1-8B",
    "LEK-Qwen-2.5-7B": "LEK-Qwen-2.5-7B",
    "LEK-Mistral-7B-v0.3": "LEK-Mistral-7B-v0.3",
 }
 def sync_model_cards(models=None, dry_run=False):
    try:
        from huggingface_hub import HfApi
    except ImportError:
        print("Error: pip install huggingface_hub")
        sys.exit(1)
    api = HfApi()
    cards = sorted(CARDS_DIR.glob("*.md"))
    if not cards:
        print(f"No cards found in {CARDS_DIR}")
        return
    for card_path in cards:
        # Extract model name: LEK-Gemma3-12B-README.md → LEK-Gemma3-12B
        name = card_path.stem.replace("-README", "")
        if name not in MODEL_MAP:
            print(f"  Skip: {card_path.name} (not in MODEL_MAP)")
            continue
        if models and name not in models:
            continue
        repo_id = f"{HF_ORG}/{MODEL_MAP[name]}"
        if dry_run:
            print(f"  [DRY RUN] {card_path.name} → {repo_id}/README.md")
            continue
        try:
            api.upload_file(
                path_or_fileobj=str(card_path),
                path_in_repo="README.md",
                repo_id=repo_id,
                repo_type="model",
                commit_message=f"Update model card from LEM repo",
            )
            print(f"  Synced: {name} → {repo_id}")
        except Exception as e:
            print(f"  Error: {name} → {e}")
 def sync_benchmarks(dry_run=False):
    try:
        from huggingface_hub import HfApi
    except ImportError:
        print("Error: pip install huggingface_hub")
        sys.exit(1)
    api = HfApi()
    dataset_id = f"{HF_ORG}/LEK-benchmarks"
    # Collect benchmark files
    files = []
    for f in sorted(BENCHMARKS_DIR.rglob("*")):
        if f.is_file() and not f.name.startswith("."):
            rel = f.relative_to(REPO_ROOT)
            files.append((str(f), str(rel)))
    if dry_run:
        print(f"  [DRY RUN] Would upload {len(files)} files to {dataset_id}")
        for local, remote in files[:10]:
            print(f"    {remote}")
        if len(files) > 10:
            print(f"    ... and {len(files) - 10} more")
        return
    for local, remote in files:
        try:
            api.upload_file(
                path_or_fileobj=local,
                path_in_repo=remote,
                repo_id=dataset_id,
                repo_type="dataset",
                commit_message=f"Update benchmarks from LEM repo",
            )
        except Exception as e:
            print(f"  Error: {remote} → {e}")
    print(f"  Synced {len(files)} benchmark files to {dataset_id}")
 def sync_training_parquet(dry_run=False):
    """Export training data as Parquet and sync to HuggingFace dataset."""
    try:
        import pyarrow as pa
        import pyarrow.parquet as pq
        from huggingface_hub import HfApi
    except ImportError:
        print("Error: pip install pyarrow huggingface_hub")
        sys.exit(1)
    import json
    api = HfApi()
    dataset_id = f"{HF_ORG}/LEK-training"
    output_dir = REPO_ROOT / "training" / "parquet"
    output_dir.mkdir(exist_ok=True)
    for split in ["train", "valid", "test"]:
        jsonl_path = TRAINING_DIR / f"{split}.jsonl"
        if not jsonl_path.exists():
            print(f"  Skip: {jsonl_path} not found")
            continue
        rows = []
        with open(jsonl_path) as f:
            for line in f:
                data = json.loads(line)
                msgs = data.get("messages", [])
                prompt = next((m["content"] for m in msgs if m["role"] == "user"), "")
                response = next((m["content"] for m in msgs if m["role"] == "assistant"), "")
                rows.append({"prompt": prompt, "response": response, "messages": json.dumps(msgs)})
        table = pa.table({
            "prompt": [r["prompt"] for r in rows],
            "response": [r["response"] for r in rows],
            "messages": [r["messages"] for r in rows],
        })
        parquet_path = output_dir / f"{split}.parquet"
        pq.write_table(table, parquet_path)
        print(f"  Exported: {split}.parquet ({len(rows)} rows)")
        if dry_run:
            continue
        try:
            api.upload_file(
                path_or_fileobj=str(parquet_path),
                path_in_repo=f"data/{split}.parquet",
                repo_id=dataset_id,
                repo_type="dataset",
                commit_message=f"Update {split} split from LEM repo",
            )
            print(f"  Uploaded: {split}.parquet → {dataset_id}")
        except Exception as e:
            print(f"  Error uploading {split}: {e}")
 def main():
    parser = argparse.ArgumentParser(description="Sync LEM repo to HuggingFace")
    parser.add_argument("--models", nargs="*", default=None,
                        help="Specific models to sync (default: all)")
    parser.add_argument("--benchmarks", action="store_true",
                        help="Sync benchmark dataset")
    parser.add_argument("--training", action="store_true",
                        help="Export training data as Parquet and sync")
    parser.add_argument("--all", action="store_true",
                        help="Sync everything (cards + benchmarks + training)")
    parser.add_argument("--dry-run", action="store_true",
                        help="Show what would be synced")
    args = parser.parse_args()
    # Default to cards if nothing specified
    do_cards = args.all or (not args.benchmarks and not args.training)
    do_benchmarks = args.all or args.benchmarks
    do_training = args.all or args.training
    if do_cards:
        print("Syncing model cards...")
        sync_model_cards(models=args.models, dry_run=args.dry_run)
    if do_benchmarks:
        print("\nSyncing benchmarks...")
        sync_benchmarks(dry_run=args.dry_run)
    if do_training:
        print("\nExporting and syncing training data (Parquet)...")
        sync_training_parquet(dry_run=args.dry_run)
    print("\nDone.")
 if __name__ == "__main__":
    main()