- run_benchmarks.sh: wrapper for lm-eval with suite presets (quick, classic, leaderboard-v2, full) - compare_models.py: compare base vs LEK results with delta table - Supports HF transformers, local-chat-completions (MLX/Ollama), and vLLM backends - Results comparable to HuggingFace Open LLM Leaderboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| compare_models.py | ||
| compare_v1_v2.py | ||
| export_parquet.py | ||
| lem_benchmark.py | ||
| lem_cross_arch_benchmark.py | ||
| lem_cross_arch_train.py | ||
| lem_gemini3_generate.py | ||
| lem_gemini3flash_generate.py | ||
| lem_gemini25flash_generate.py | ||
| lem_scale_benchmark.py | ||
| lem_scale_scorer.py | ||
| lem_scorer.py | ||
| lem_self_scorer.py | ||
| lem_semantic_scorer.py | ||
| lem_standard_benchmark.py | ||
| lem_standard_scorer.py | ||
| push_all_models.py | ||
| run_benchmarks.sh | ||
| sync_hf.py | ||