Add standard benchmark suite (lm-evaluation-harness) #3

Open
Charon wants to merge 3 commits from Charon/LEM:feat/standard-benchmarks into main
Member

Summary

  • scripts/run_benchmarks.sh — wrapper for EleutherAI's lm-evaluation-harness
  • scripts/compare_models.py — compare base vs LEK results with delta tables
  • Results are directly comparable to HuggingFace Open LLM Leaderboard

Benchmark Suites

Suite Tasks Purpose
quick GSM8K, HellaSwag, TruthfulQA, ARC, Winogrande Fast sanity check (~10 min)
classic + MMLU Open LLM Leaderboard v1
leaderboard-v2 IFEval, BBH, GPQA, MuSR, MMLU-PRO Current standard (harder)
full All of the above Complete evaluation

Usage

pipx install lm-eval

# Base model
./scripts/run_benchmarks.sh --model hf --model-id google/gemma-3-12b-it --suite quick

# LEK model via MLX API
./scripts/run_benchmarks.sh --model local-chat-completions --api-url http://localhost:8090/v1 --suite quick

# Compare results
python3 scripts/compare_models.py --base benchmarks/lm-eval-results/base_run --lek benchmarks/lm-eval-results/lek_run

Backends

  • hf: HuggingFace transformers (local GPU)
  • local-chat-completions: OpenAI-compatible API (MLX server, Ollama, llama.cpp)
  • vllm: vLLM inference server

Generated with Claude Code (claude.ai/code)

## Summary - `scripts/run_benchmarks.sh` — wrapper for EleutherAI's lm-evaluation-harness - `scripts/compare_models.py` — compare base vs LEK results with delta tables - Results are directly comparable to HuggingFace Open LLM Leaderboard ## Benchmark Suites | Suite | Tasks | Purpose | |-------|-------|---------| | `quick` | GSM8K, HellaSwag, TruthfulQA, ARC, Winogrande | Fast sanity check (~10 min) | | `classic` | + MMLU | Open LLM Leaderboard v1 | | `leaderboard-v2` | IFEval, BBH, GPQA, MuSR, MMLU-PRO | Current standard (harder) | | `full` | All of the above | Complete evaluation | ## Usage ```bash pipx install lm-eval # Base model ./scripts/run_benchmarks.sh --model hf --model-id google/gemma-3-12b-it --suite quick # LEK model via MLX API ./scripts/run_benchmarks.sh --model local-chat-completions --api-url http://localhost:8090/v1 --suite quick # Compare results python3 scripts/compare_models.py --base benchmarks/lm-eval-results/base_run --lek benchmarks/lm-eval-results/lek_run ``` ## Backends - **hf**: HuggingFace transformers (local GPU) - **local-chat-completions**: OpenAI-compatible API (MLX server, Ollama, llama.cpp) - **vllm**: vLLM inference server Generated with Claude Code (claude.ai/code)
Charon added 2 commits 2026-02-15 00:06:08 +00:00
- Add 4 missing model cards: Gemma3-1B-layered (v1+v2), Gemma3-27B, GPT-OSS-20B
- All 9 HF models now have cards in paper/hf-cards/
- sync_hf.py: push cards + benchmarks + training data to HuggingFace
- export_parquet.py: convert JSONL training splits to Parquet (HF dataset format)
- Parquet schema: prompt, response, system, messages (JSON)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- run_benchmarks.sh: wrapper for lm-eval with suite presets (quick, classic, leaderboard-v2, full)
- compare_models.py: compare base vs LEK results with delta table
- Supports HF transformers, local-chat-completions (MLX/Ollama), and vLLM backends
- Results comparable to HuggingFace Open LLM Leaderboard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Snider added 1 commit 2026-02-15 00:13:49 +00:00
This pull request has changes conflicting with the target branch.
  • .gitignore
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u feat/standard-benchmarks:Charon-feat/standard-benchmarks
git checkout Charon-feat/standard-benchmarks

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout main
git merge --no-ff Charon-feat/standard-benchmarks
git checkout Charon-feat/standard-benchmarks
git rebase main
git checkout main
git merge --ff-only Charon-feat/standard-benchmarks
git checkout Charon-feat/standard-benchmarks
git rebase main
git checkout main
git merge --no-ff Charon-feat/standard-benchmarks
git checkout main
git merge --squash Charon-feat/standard-benchmarks
git checkout main
git merge --ff-only Charon-feat/standard-benchmarks
git checkout main
git merge Charon-feat/standard-benchmarks
git push origin main
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lthn/LEM#3
No description provided.