- run_benchmarks.sh: wrapper for lm-eval with suite presets (quick, classic, leaderboard-v2, full) - compare_models.py: compare base vs LEK results with delta table - Supports HF transformers, local-chat-completions (MLX/Ollama), and vLLM backends - Results comparable to HuggingFace Open LLM Leaderboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 lines
266 B
Text
13 lines
266 B
Text
.DS_Store
|
|
.idea/
|
|
__pycache__/
|
|
*.pyc
|
|
|
|
# Worker output (generated locally, not committed)
|
|
worker/output/
|
|
|
|
# Parquet exports (generated, sync to HF via scripts/sync_hf.py)
|
|
training/parquet/
|
|
|
|
# lm-eval-harness results (large, stored locally)
|
|
benchmarks/lm-eval-results/
|