Runs the same prompts through baseline and fine-tuned models, scores
both with the heuristic scorer, and outputs a comparison report with
LEK score deltas and improvement/regression counts.
Uses built-in content probes by default, or custom prompts file.
Co-Authored-By: Virgil <virgil@lethean.io>