Frozen evals · Open-weight LLMs

Reproducible evaluations of open-weight language models. Every score is one click from its samples.

updated:1 weeks ago28 models · 13 benchmarks

Featured benchmarks

Tasks we've run most thoroughly across open-weight models. Click a tile for the scoreboard; click any bar segment for that model's samples.

View leaderboard →

Featured

MMLU

MMLU (Massive Multitask Language Understanding) is a broad-coverage knowledge and reasoning benchmark spanning 57 subjects across STEM, the humanities, the social sciences, law, medicine, and other professional domains. It contains 14,079 four-choice multiple-choice test items plus a 1,540-item dev/validation split used for few-shot example selection, with difficulty calibrated from elementary up to advanced-professional level. Released in 2020, it became the de facto industry standard for measuring an LLM's breadth of world knowledge and remains one of the most widely cited LLM benchmarks despite its age.

28 models · 2026-05-26T00:39:08.035851Z

rank 1 → 5

GSM8K

GSM8K (Grade School Math 8K) is a dataset of 8,500 linguistically diverse English-language grade-school math word problems, split into 7,473 training and 1,319 test items, each requiring between two and eight elementary arithmetic steps to solve. The problems were authored by professional human problem-writers, with reference solutions that include natural-language reasoning chains in addition to the final numeric answer. GSM8K became the standard probe for whether a language model can chain together multiple arithmetic steps under natural- language framing, and it remains a near-mandatory entry on foundation-model evaluation cards.

28 models · 2026-05-19T03:25:18.539293Z

rank 1 → 5

IFEval

IFEval (Instruction-Following Evaluation) is a 541-prompt benchmark designed to measure how well a model obeys explicit, programmatically verifiable instructions in its output. Each prompt embeds one or more constraints drawn from a taxonomy of 25 instruction types including length constraints (word/sentence/paragraph counts), keyword inclusion or exclusion, formatting requirements (bullet points, JSON, title case), language-of-output constraints, and structural rules. Compliance is checked by deterministic regex and counting rules with no LLM judge or human annotator in the loop, which makes IFEval cheap and fully reproducible at the cost of checking only what those rules can express.

28 models · 2026-05-25T11:25:15.921931Z

rank 1 → 5

GPQA Diamond

GPQA Diamond is the 198-question hardest tier of GPQA, selected by two filters applied jointly: both expert annotators must agree on the correct answer, and at most one of three skilled non-expert validators with unrestricted web access could solve the question. The result is the highest-confidence "Google-proof" subset of the parent benchmark and the variant most commonly reported on frontier-model leaderboards and in research papers. It is the canonical headline number when GPQA is cited.

28 models · 2026-05-25T10:54:02.797737Z

rank 1 → 5

Recent runs

The latest eval runs, newest first.

View leaderboard →

MiniMax/MiniMax-M2.1-AWQ
on MMLU-Pro
2026-05-26T10:16:31.063838Z75.6%±0.4%
MiniMax/MiniMax-M2.1-AWQ
on MMLU
2026-05-26T00:39:08.035851Z25.5%±0.4%
MiniMax/MiniMax-M2.1-AWQ
on MGSM
2026-05-26T00:15:46.790352Z48.2%±0.8%
MiniMax/MiniMax-M2.1-AWQ
on MBPP(Instruct)
2026-05-25T23:33:04.620228Z0.0%±0.0%
MiniMax/MiniMax-M2.1-AWQ
on MBPP
2026-05-25T23:13:06.441864Z60.4%±2.2%
MiniMax/MiniMax-M2.1-AWQ
on IFEval
2026-05-25T11:25:15.921931Z35.3%±2.1%
MiniMax/MiniMax-M2.1-AWQ
on GPQA Extended
2026-05-25T10:54:02.797737Z23.4%±1.8%
MiniMax/MiniMax-M2.1-AWQ
on GPQA Main
2026-05-25T10:54:02.797737Z24.3%±2.0%
MiniMax/MiniMax-M2.1-AWQ
on GPQA Diamond
2026-05-25T10:54:02.797737Z28.8%±3.2%
MiniMax/MiniMax-M2.1-AWQ
on EQ-Bench
2026-05-25T04:17:35.623656Z52.5±3.1
Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507
on MBPP(Instruct)
2026-05-25T04:07:27.878042Z0.0%±0.0%
MiniMax/MiniMax-M2.1-AWQ
on GSM8K
2026-05-19T03:25:18.539293Z70.4%±1.3%

Spotlight samples

Three hand-picked samples where open-weight models diverge on the same input.

correctzai-org/GLM-4.5V-FP8

on mmlu_formal_logic · q.12

“Which of the following propositions is an immediate (one-step) consequence in PL of the given premises? E ⊃ ~F ~F ⊃ G ~G”

GLM-4.5V gets this modus tollens chain right while missing harder truth-table questions — formal logic performance is brittle on multi-step derivations.

incorrectQwen/Qwen3-32B

on GSM8K · q.12

“Carlos is planting a lemon tree. The tree will cost $90 to plant. Each year it will grow 7 lemons, which he can sell for $1.5 each. It costs $3 a year to water and feed the tree. How many years will it take before he starts earning money on the lemon tree?”

Qwen3-32B hits a finish_reason=length on this lemon-tree problem despite 94% overall accuracy — the chain-of-thought runs long on break-even calculations.

unknowngoogle/gemma-3-27b-it

on IFEval · q.7

“Write me a resume for Matthias Algiers. Use words with all capital letters to highlight key abilities, but make sure that words with all capital letters appear less than 10 times. Wrap the entire response with double quotation marks.”

Gemma-3-27B fails the 'wrap in double quotes' constraint while nailing content — instruction-following breaks down on formatting meta-rules.

How FrozeBench works

Every run records seed, temperature, and config. Any sample link leads to a reproducible evaluation.

Read our methodology →