No results
1
Scoring
Claude edited this page 2026-02-23 19:41:13 +00:00
Scoring
Three-tier scoring pipeline validates expansion responses before they enter the training set.
Tier 1: Heuristic (Instant, No API)
Fast local checks with no network calls:
- Compliance markers — Detects formulaic safety preambles
- Degeneration — Token loops, repetition collapse
- Engagement depth — Response substance vs filler
- Creative form — Structural variety in responses
- Formulaic preamble — "As an AI..." type openings
Pass criteria: heuristic score > 0
Tier 2: LEM Self-Judge (Local)
A LEM-27B model scores responses on four dimensions (1-10 each):
| Dimension | What it measures |
|---|---|
| Sovereignty | Respects individual autonomy and self-determination |
| Ethical depth | Engages meaningfully with ethical complexity |
| Creative expression | Avoids formulaic responses, shows genuine reasoning |
| Self-concept | Coherent identity without servile patterns |
Pass criteria: judge average >= 6.0
Tier 3: Gemini Judge (TPU Credit)
External validation for borderline cases (scores 5.0-6.5 from Tier 2). Uses Gemini on TPU for content scoring.
Implementation
Go Unified Scorer
- Location:
~/projects/lem-scorer/(50 tests, 3,726 LOC, stdlib only) - Commands:
score(existing files),probe(generate + score),compare(diff scores) - Judge endpoint: gemma-3-27b via MLX server at M3:8090
- SSH tunnel:
ssh -f -N -L 18090:localhost:8090 m3
Pipeline Integration
# Score with heuristic
core ml score --tier 1 --input expansion-responses/
# Score with LEM judge
core ml score --tier 2 --input expansion-responses/
# Approve passing responses
core ml approve --threshold 6.0
Calibration
Go scorer validated against Python implementation:
- Heuristic scores match within 0.68
- Semantic judge matches Gemini Flash within 0.37
Important: Must set max_tokens on API requests or MLX server blocks on long generations.