1
Scoring
Virgil edited this page 2026-02-19 18:36:27 +00:00
Scoring Engine
Architecture
The scoring engine evaluates model responses across multiple suites:
engine := ml.NewEngine(judge, workers, suites)
results := engine.Score(ctx, responses)
Scoring Suites
Heuristic (heuristic.go)
Fast rule-based scoring that requires no LLM:
- Response length and structure
- Repetition detection
- Format compliance
- Keyword presence
Judge (judge.go)
LLM-as-judge evaluation using a secondary model:
- Sends response + rubric to judge model
- Parses structured score + reasoning
- Supports configurable rubrics per task type
Exact Match (exact.go)
Deterministic scoring for tasks with known answers (e.g., GSM8K):
- Exact string matching
- Normalised comparison (whitespace, case)
- Numeric extraction and comparison
Probes (probes.go)
Ethics-aware evaluation probes:
- Safety and harm detection
- Bias assessment
- Instruction following compliance
- Output format validation
Concurrent Execution
The worker.go pool runs scoring suites in parallel across multiple responses. Configurable worker count balances throughput with resource usage.