feat/data-hydration #6

Merged
Snider merged 10 commits from feat/data-hydration into main 2026-03-04 09:55:09 +00:00
Owner
No description provided.
Snider added 10 commits 2026-03-04 09:54:46 +00:00
Delete exact duplicates (p4-train/p4-valid, backups, lessons-augmented)
and empty files (ethics/golden/train, 1b/lesson-0) to prepare for
zstd compression.

Co-Authored-By: Virgil <virgil@lethean.io>
Compress 108 JSONL files with zstd level 3 and remove originals.
Add training/.gitignore to exclude decompressed .jsonl files.
Remove broken symlinks to previously deleted p4 duplicates.

Co-Authored-By: Virgil <virgil@lethean.io>
Add compressFileZstd, decompressZstd, and walkZstFiles helpers
using klauspost/compress. Promote zstd from indirect to direct dep.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
RunSetup decompresses .jsonl.zst training data into DuckDB tables
(training_examples, seeds, probes, distill_results) and optionally
backfills InfluxDB with aggregate stats.

Co-Authored-By: Virgil <virgil@lethean.io>
Register setup group with data subcommand that hydrates cold
compressed JSONL.zst training data into warm DuckDB tables.

Co-Authored-By: Virgil <virgil@lethean.io>
Export distill_results from DuckDB back to compressed JSONL.zst files,
completing the cold -> warm -> cold round-trip data pipeline.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
The emotional register scorer only matched positive/neutral emotions
(joy, compassion, tender, etc.) and completely missed negative human
expressions (angry, furious, devastated, terrified, bleeding, screaming).

This caused a real Reddit AITA post about a distressed mother to score
emotional_register=1 despite containing "screaming in pain", "pooping
blood", and "blind rage", leading to a false ai_generated verdict.

Changes:
- Add 4 new pattern groups: distress/anger, sadness/despair, fear/anxiety,
  physical distress (~40 new vocabulary words)
- Switch from int count to weighted float64 scoring — intensity groups
  (vulnerability, distress, physical) score 1.5-2.0x per match vs 1.0x
  for common emotion words
- Round to 1 decimal place, cap at 10.0
- Update tests with distress/anger/physical cases including the Reddit
  failure case from calibration findings

Co-Authored-By: Virgil <virgil@lethean.io>
NeRF-inspired technique for learning relational dynamics of language.
Not what words mean, but how they behave together — rhythm, pacing,
punctuation patterns, style transitions.

v1: positional field over text (baseline, memorises)
v2: masked feature prediction (relational, actually works)

Trained on Wodehouse "My Man Jeeves" (public domain, Gutenberg).
All 11 style features are highly relational — the field learns that
Wodehouse's style is a tightly coupled system.

Key finding: style interpolation between narrative and dialogue
produces sensible predictions for unmeasured features, suggesting
the continuous field captures real structural patterns.

Co-Authored-By: Virgil <virgil@lethean.io>
Snider merged commit f4fa577405 into main 2026-03-04 09:55:09 +00:00
Snider referenced this pull request from a commit 2026-03-04 09:55:11 +00:00
Snider deleted branch feat/data-hydration 2026-03-04 09:55:13 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lthn/LEM#6
No description provided.