1 Scoring
Claude edited this page 2026-02-23 19:41:13 +00:00

Scoring

Three-tier scoring pipeline validates expansion responses before they enter the training set.

Tier 1: Heuristic (Instant, No API)

Fast local checks with no network calls:

  • Compliance markers — Detects formulaic safety preambles
  • Degeneration — Token loops, repetition collapse
  • Engagement depth — Response substance vs filler
  • Creative form — Structural variety in responses
  • Formulaic preamble — "As an AI..." type openings

Pass criteria: heuristic score > 0

Tier 2: LEM Self-Judge (Local)

A LEM-27B model scores responses on four dimensions (1-10 each):

Dimension What it measures
Sovereignty Respects individual autonomy and self-determination
Ethical depth Engages meaningfully with ethical complexity
Creative expression Avoids formulaic responses, shows genuine reasoning
Self-concept Coherent identity without servile patterns

Pass criteria: judge average >= 6.0

Tier 3: Gemini Judge (TPU Credit)

External validation for borderline cases (scores 5.0-6.5 from Tier 2). Uses Gemini on TPU for content scoring.

Implementation

Go Unified Scorer

  • Location: ~/projects/lem-scorer/ (50 tests, 3,726 LOC, stdlib only)
  • Commands: score (existing files), probe (generate + score), compare (diff scores)
  • Judge endpoint: gemma-3-27b via MLX server at M3:8090
  • SSH tunnel: ssh -f -N -L 18090:localhost:8090 m3

Pipeline Integration

# Score with heuristic
core ml score --tier 1 --input expansion-responses/

# Score with LEM judge
core ml score --tier 2 --input expansion-responses/

# Approve passing responses
core ml approve --threshold 6.0

Calibration

Go scorer validated against Python implementation:

  • Heuristic scores match within 0.68
  • Semantic judge matches Gemini Flash within 0.37

Important: Must set max_tokens on API requests or MLX server blocks on long generations.