Add standard benchmark suite (lm-evaluation-harness) #3

Open

Charon wants to merge 3 commits from Charon/LEM:feat/standard-benchmarks into main

Charon commented

2026-02-15 00:06:08 +00:00

Member

Summary

scripts/run_benchmarks.sh — wrapper for EleutherAI's lm-evaluation-harness
scripts/compare_models.py — compare base vs LEK results with delta tables
Results are directly comparable to HuggingFace Open LLM Leaderboard

Benchmark Suites

Suite	Tasks	Purpose
`quick`	GSM8K, HellaSwag, TruthfulQA, ARC, Winogrande	Fast sanity check (~10 min)
`classic`	+ MMLU	Open LLM Leaderboard v1
`leaderboard-v2`	IFEval, BBH, GPQA, MuSR, MMLU-PRO	Current standard (harder)
`full`	All of the above	Complete evaluation

Usage

pipx install lm-eval

# Base model
./scripts/run_benchmarks.sh --model hf --model-id google/gemma-3-12b-it --suite quick

# LEK model via MLX API
./scripts/run_benchmarks.sh --model local-chat-completions --api-url http://localhost:8090/v1 --suite quick

# Compare results
python3 scripts/compare_models.py --base benchmarks/lm-eval-results/base_run --lek benchmarks/lm-eval-results/lek_run

Backends

hf: HuggingFace transformers (local GPU)
local-chat-completions: OpenAI-compatible API (MLX server, Ollama, llama.cpp)
vllm: vLLM inference server

Generated with Claude Code (claude.ai/code)

## Summary - `scripts/run_benchmarks.sh` — wrapper for EleutherAI's lm-evaluation-harness - `scripts/compare_models.py` — compare base vs LEK results with delta tables - Results are directly comparable to HuggingFace Open LLM Leaderboard ## Benchmark Suites | Suite | Tasks | Purpose | |-------|-------|---------| | `quick` | GSM8K, HellaSwag, TruthfulQA, ARC, Winogrande | Fast sanity check (~10 min) | | `classic` | + MMLU | Open LLM Leaderboard v1 | | `leaderboard-v2` | IFEval, BBH, GPQA, MuSR, MMLU-PRO | Current standard (harder) | | `full` | All of the above | Complete evaluation | ## Usage ```bash pipx install lm-eval # Base model ./scripts/run_benchmarks.sh --model hf --model-id google/gemma-3-12b-it --suite quick # LEK model via MLX API ./scripts/run_benchmarks.sh --model local-chat-completions --api-url http://localhost:8090/v1 --suite quick # Compare results python3 scripts/compare_models.py --base benchmarks/lm-eval-results/base_run --lek benchmarks/lm-eval-results/lek_run ``` ## Backends - **hf**: HuggingFace transformers (local GPU) - **local-chat-completions**: OpenAI-compatible API (MLX server, Ollama, llama.cpp) - **vllm**: vLLM inference server Generated with Claude Code (claude.ai/code)

Charon added 2 commits 2026-02-15 00:06:08 +00:00

Add missing HF model cards, sync script, and Parquet export b8f9191b05

- Add 4 missing model cards: Gemma3-1B-layered (v1+v2), Gemma3-27B, GPT-OSS-20B
- All 9 HF models now have cards in paper/hf-cards/
- sync_hf.py: push cards + benchmarks + training data to HuggingFace
- export_parquet.py: convert JSONL training splits to Parquet (HF dataset format)
- Parquet schema: prompt, response, system, messages (JSON)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add standard benchmark suite using EleutherAI lm-evaluation-harness abd63d3342

- run_benchmarks.sh: wrapper for lm-eval with suite presets (quick, classic, leaderboard-v2, full)
- compare_models.py: compare base vs LEK results with delta table
- Supports HF transformers, local-chat-completions (MLX/Ollama), and vLLM backends
- Results comparable to HuggingFace Open LLM Leaderboard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Snider added 1 commit 2026-02-15 00:13:49 +00:00

Merge branch 'main' into feat/standard-benchmarks da3db34e45

This pull request has changes conflicting with the target branch.

.gitignore

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u feat/standard-benchmarks:Charon-feat/standard-benchmarks

git checkout Charon-feat/standard-benchmarks

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout main

git merge --no-ff Charon-feat/standard-benchmarks

git checkout Charon-feat/standard-benchmarks

git rebase main

git checkout main

git merge --ff-only Charon-feat/standard-benchmarks

git checkout Charon-feat/standard-benchmarks

git rebase main

git checkout main

git merge --no-ff Charon-feat/standard-benchmarks

git checkout main

git merge --squash Charon-feat/standard-benchmarks