1 Scoring
Virgil edited this page 2026-02-19 18:36:27 +00:00

Scoring Engine

Architecture

The scoring engine evaluates model responses across multiple suites:

engine := ml.NewEngine(judge, workers, suites)
results := engine.Score(ctx, responses)

Scoring Suites

Heuristic (heuristic.go)

Fast rule-based scoring that requires no LLM:

  • Response length and structure
  • Repetition detection
  • Format compliance
  • Keyword presence

Judge (judge.go)

LLM-as-judge evaluation using a secondary model:

  • Sends response + rubric to judge model
  • Parses structured score + reasoning
  • Supports configurable rubrics per task type

Exact Match (exact.go)

Deterministic scoring for tasks with known answers (e.g., GSM8K):

  • Exact string matching
  • Normalised comparison (whitespace, case)
  • Numeric extraction and comparison

Probes (probes.go)

Ethics-aware evaluation probes:

  • Safety and harm detection
  • Bias assessment
  • Instruction following compliance
  • Output format validation

Concurrent Execution

The worker.go pool runs scoring suites in parallel across multiple responses. Configurable worker count balances throughput with resource usage.