Add standard benchmark suite (lm-evaluation-harness) #3

Open

Charon wants to merge 3 commits from Charon/LEM:feat/standard-benchmarks into main

Author	SHA1	Message	Date
Snider	da3db34e45	Merge branch 'main' into feat/standard-benchmarks	2026-02-15 00:13:47 +00:00
Charon	abd63d3342	Add standard benchmark suite using EleutherAI lm-evaluation-harness - run_benchmarks.sh: wrapper for lm-eval with suite presets (quick, classic, leaderboard-v2, full) - compare_models.py: compare base vs LEK results with delta table - Supports HF transformers, local-chat-completions (MLX/Ollama), and vLLM backends - Results comparable to HuggingFace Open LLM Leaderboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 00:05:48 +00:00
Charon	b8f9191b05	Add missing HF model cards, sync script, and Parquet export - Add 4 missing model cards: Gemma3-1B-layered (v1+v2), Gemma3-27B, GPT-OSS-20B - All 9 HF models now have cards in paper/hf-cards/ - sync_hf.py: push cards + benchmarks + training data to HuggingFace - export_parquet.py: convert JSONL training splits to Parquet (HF dataset format) - Parquet schema: prompt, response, system, messages (JSON) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 23:50:18 +00:00

Author

SHA1

Message

Date

Snider

da3db34e45

Merge branch 'main' into feat/standard-benchmarks

2026-02-15 00:13:47 +00:00

Charon

abd63d3342

Add standard benchmark suite using EleutherAI lm-evaluation-harness

- run_benchmarks.sh: wrapper for lm-eval with suite presets (quick, classic, leaderboard-v2, full)
- compare_models.py: compare base vs LEK results with delta table
- Supports HF transformers, local-chat-completions (MLX/Ollama), and vLLM backends
- Results comparable to HuggingFace Open LLM Leaderboard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-15 00:05:48 +00:00

Charon

b8f9191b05

Add missing HF model cards, sync script, and Parquet export

- Add 4 missing model cards: Gemma3-1B-layered (v1+v2), Gemma3-27B, GPT-OSS-20B
- All 9 HF models now have cards in paper/hf-cards/
- sync_hf.py: push cards + benchmarks + training data to HuggingFace
- export_parquet.py: convert JSONL training splits to Parquet (HF dataset format)
- Parquet schema: prompt, response, system, messages (JSON)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-14 23:50:18 +00:00