feat/data-hydration #6

Merged

Snider merged 10 commits from feat/data-hydration into main

2026-03-04 09:55:09 +00:00

Snider commented

2026-03-04 09:54:44 +00:00

Owner

No description provided.

Snider added 10 commits

2026-03-04 09:54:46 +00:00

chore: remove 13 duplicate/empty JSONL files from training data 631870bffc

Delete exact duplicates (p4-train/p4-valid, backups, lessons-augmented)
and empty files (ethics/golden/train, 1b/lesson-0) to prepare for
zstd compression.

Co-Authored-By: Virgil <virgil@lethean.io>

chore: zstd compress all JSONL training data (145MB → 29MB) d998dd252b

Compress 108 JSONL files with zstd level 3 and remove originals.
Add training/.gitignore to exclude decompressed .jsonl files.
Remove broken symlinks to previously deleted p4 duplicates.

Co-Authored-By: Virgil <virgil@lethean.io>

feat: add zstd compress/decompress helpers for cold storage 1c7194176a

Add compressFileZstd, decompressZstd, and walkZstFiles helpers
using klauspost/compress. Promote zstd from indirect to direct dep.

Co-Authored-By: Virgil <virgil@lethean.io>

chore: promote klauspost/compress to direct dependency a7ff1efc3f

Co-Authored-By: Virgil <virgil@lethean.io>

feat: add data hydration engine (cold JSONL.zst -> warm DuckDB -> hot InfluxDB) 1269e70853

RunSetup decompresses .jsonl.zst training data into DuckDB tables
(training_examples, seeds, probes, distill_results) and optionally
backfills InfluxDB with aggregate stats.

Co-Authored-By: Virgil <virgil@lethean.io>

feat: wire 'lem setup --data' CLI command 061dd4cedd

Register setup group with data subcommand that hydrates cold
compressed JSONL.zst training data into warm DuckDB tables.

Co-Authored-By: Virgil <virgil@lethean.io>

feat: add 'lem data export-cold' for warm DuckDB -> cold JSONL.zst export 1b570b8229

Export distill_results from DuckDB back to compressed JSONL.zst files,
completing the cold -> warm -> cold round-trip data pipeline.

Co-Authored-By: Virgil <virgil@lethean.io>

feat: upgrade BO analysis to 8D with Q/K interaction metrics (22D full vector) 54151c463b

Co-Authored-By: Virgil <virgil@lethean.io>

fix: expand emotional_register to include distress, anger, fear vocabulary 41d8008e69

The emotional register scorer only matched positive/neutral emotions
(joy, compassion, tender, etc.) and completely missed negative human
expressions (angry, furious, devastated, terrified, bleeding, screaming).

This caused a real Reddit AITA post about a distressed mother to score
emotional_register=1 despite containing "screaming in pain", "pooping
blood", and "blind rage", leading to a false ai_generated verdict.

Changes:
- Add 4 new pattern groups: distress/anger, sadness/despair, fear/anxiety,
  physical distress (~40 new vocabulary words)
- Switch from int count to weighted float64 scoring — intensity groups
  (vulnerability, distress, physical) score 1.5-2.0x per match vs 1.0x
  for common emotion words
- Round to 1 decimal place, cap at 10.0
- Update tests with distress/anger/physical cases including the Reddit
  failure case from calibration findings

Co-Authored-By: Virgil <virgil@lethean.io>

feat: WoRF — Word Radiance Field experiments f79eaabdce

NeRF-inspired technique for learning relational dynamics of language.
Not what words mean, but how they behave together — rhythm, pacing,
punctuation patterns, style transitions.

v1: positional field over text (baseline, memorises)
v2: masked feature prediction (relational, actually works)

Trained on Wodehouse "My Man Jeeves" (public domain, Gutenberg).
All 11 style features are highly relational — the field learns that
Wodehouse's style is a tightly coupled system.

Key finding: style interpolation between narrative and dialogue
produces sensible predictions for unmeasured features, suggesting
the continuous field captures real structural patterns.

Co-Authored-By: Virgil <virgil@lethean.io>

Snider merged commit f4fa577405 into main

2026-03-04 09:55:09 +00:00

Snider referenced this pull request from a commit

2026-03-04 09:55:11 +00:00

feat/data-hydration (#6)

Snider deleted branch feat/data-hydration

2026-03-04 09:55:13 +00:00