feat/data-hydration #6

Merged
Snider merged 10 commits from feat/data-hydration into main 2026-03-04 09:55:09 +00:00

10 commits

Author SHA1 Message Date
Snider
f79eaabdce feat: WoRF — Word Radiance Field experiments
NeRF-inspired technique for learning relational dynamics of language.
Not what words mean, but how they behave together — rhythm, pacing,
punctuation patterns, style transitions.

v1: positional field over text (baseline, memorises)
v2: masked feature prediction (relational, actually works)

Trained on Wodehouse "My Man Jeeves" (public domain, Gutenberg).
All 11 style features are highly relational — the field learns that
Wodehouse's style is a tightly coupled system.

Key finding: style interpolation between narrative and dialogue
produces sensible predictions for unmeasured features, suggesting
the continuous field captures real structural patterns.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-03-04 09:43:38 +00:00
Snider
41d8008e69 fix: expand emotional_register to include distress, anger, fear vocabulary
The emotional register scorer only matched positive/neutral emotions
(joy, compassion, tender, etc.) and completely missed negative human
expressions (angry, furious, devastated, terrified, bleeding, screaming).

This caused a real Reddit AITA post about a distressed mother to score
emotional_register=1 despite containing "screaming in pain", "pooping
blood", and "blind rage", leading to a false ai_generated verdict.

Changes:
- Add 4 new pattern groups: distress/anger, sadness/despair, fear/anxiety,
  physical distress (~40 new vocabulary words)
- Switch from int count to weighted float64 scoring — intensity groups
  (vulnerability, distress, physical) score 1.5-2.0x per match vs 1.0x
  for common emotion words
- Round to 1 decimal place, cap at 10.0
- Update tests with distress/anger/physical cases including the Reddit
  failure case from calibration findings

Co-Authored-By: Virgil <virgil@lethean.io>
2026-03-02 22:02:34 +00:00
Snider
54151c463b feat: upgrade BO analysis to 8D with Q/K interaction metrics (22D full vector)
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:51:04 +00:00
Snider
1b570b8229 feat: add 'lem data export-cold' for warm DuckDB -> cold JSONL.zst export
Export distill_results from DuckDB back to compressed JSONL.zst files,
completing the cold -> warm -> cold round-trip data pipeline.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:21:36 +00:00
Snider
061dd4cedd feat: wire 'lem setup --data' CLI command
Register setup group with data subcommand that hydrates cold
compressed JSONL.zst training data into warm DuckDB tables.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:20:28 +00:00
Snider
1269e70853 feat: add data hydration engine (cold JSONL.zst -> warm DuckDB -> hot InfluxDB)
RunSetup decompresses .jsonl.zst training data into DuckDB tables
(training_examples, seeds, probes, distill_results) and optionally
backfills InfluxDB with aggregate stats.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:20:00 +00:00
Snider
a7ff1efc3f chore: promote klauspost/compress to direct dependency
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:18:38 +00:00
Snider
1c7194176a feat: add zstd compress/decompress helpers for cold storage
Add compressFileZstd, decompressZstd, and walkZstFiles helpers
using klauspost/compress. Promote zstd from indirect to direct dep.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:18:19 +00:00
Snider
d998dd252b chore: zstd compress all JSONL training data (145MB → 29MB)
Compress 108 JSONL files with zstd level 3 and remove originals.
Add training/.gitignore to exclude decompressed .jsonl files.
Remove broken symlinks to previously deleted p4 duplicates.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:17:38 +00:00
Snider
631870bffc chore: remove 13 duplicate/empty JSONL files from training data
Delete exact duplicates (p4-train/p4-valid, backups, lessons-augmented)
and empty files (ethics/golden/train, 1b/lesson-0) to prepare for
zstd compression.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-28 12:16:53 +00:00