MMLU-Pro

mmlu_pro

KnowledgeReasoning

MMLU-Pro is a harder successor to MMLU containing 12,032 questions across 14 disciplines (Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Other). It expands the choice set from four to ten options to suppress guessing, removes trivial and noisy items from the original MMLU pool, and curates new questions designed to require multi-step chain-of-thought reasoning rather than pure recall. Sources include MMLU itself, STEM exam banks, and TheoremQA-style problems, with expert review applied throughout.

Source paperLatest run: 2026-05-26

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

#	Model	exact_match	Actions
1	google/gemma-4-31B-it	84.9%±0.3%	View run →
2	google/Gemma-4-31B-IT-NVFP4	84.3%±0.3%	View run →
3	google/gemma-4-26B-A4B-it	82.3%±0.3%	View run →
4	Qwen/Qwen3-Next-80B-A3B-Instruct	80.8%±0.3%	View run →
5	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	80.5%±0.3%	View run →
6	MiniMax/MiniMax-M2.1-AWQ	75.6%±0.4%	View run →
7	microsoft/phi-4	71.8%±0.4%	View run →
8	Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8	70.3%±0.4%	View run →
9	Qwen/Qwen3-32B	70.0%±0.4%	View run →
10	google/gemma-3-27b-it	66.7%±0.4%	View run →
11	Qwen/Qwen3-14B	65.3%±0.4%	View run →
12	Qwen/Qwen3.5-35B-A3B	63.6%±0.4%	View run →
13	openai/gpt-oss-120b	62.2%±0.4%	View run →
14	Qwen/Qwen3.5-122B-A10B-NVFP4	60.4%±0.4%	View run →
15	google/gemma-3-12b-it	59.7%±0.4%	View run →
16	MiniMax/MiniMax-M2-AWQ	59.3%±0.4%	View run →
17	Qwen/Qwen3-8B	57.7%±0.4%	View run →
18	zai-org/GLM-4.5-Air-FP8	56.8%±0.4%	View run →
19	Qwen/Qwen3-4B-AWQ	55.7%±0.4%	View run →
20	Qwen/Qwen3-4B	54.3%±0.4%	View run →
21	microsoft/phi-4-mini-instruct	51.2%±0.4%	View run →
22	Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507	44.7%±0.4%	View run →
23	openai/gpt-oss-20b	32.6%±0.4%	View run →
24	microsoft/phi-4-reasoning-plus	29.4%±0.4%	View run →
25	Qwen/Qwen3.6-27B	26.2%±0.3%	View run →
26	microsoft/phi-4-mini-reasoning	7.3%±0.2%	View run →
27	Qwen/Qwen3.6-35B-A3B	7.2%±0.2%	View run →
28	zai-org/GLM-4.5V-FP8	6.0%±0.2%	View run →

28 models

1.google/gemma-4-31B-it

exact_match84.9%±0.3%

Model page →View run →

2.google/Gemma-4-31B-IT-NVFP4

exact_match84.3%±0.3%

Model page →View run →

3.google/gemma-4-26B-A4B-it

exact_match82.3%±0.3%

Model page →View run →

4.Qwen/Qwen3-Next-80B-A3B-Instruct

exact_match80.8%±0.3%

Model page →View run →

5.nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

exact_match80.5%±0.3%

Model page →View run →

6.MiniMax/MiniMax-M2.1-AWQ

exact_match75.6%±0.4%

Model page →View run →

7.microsoft/phi-4

exact_match71.8%±0.4%

Model page →View run →

8.Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

exact_match70.3%±0.4%

Model page →View run →

9.Qwen/Qwen3-32B

exact_match70.0%±0.4%

Model page →View run →

10.google/gemma-3-27b-it

exact_match66.7%±0.4%

Model page →View run →

11.Qwen/Qwen3-14B

exact_match65.3%±0.4%

Model page →View run →

12.Qwen/Qwen3.5-35B-A3B

exact_match63.6%±0.4%

Model page →View run →

13.openai/gpt-oss-120b

exact_match62.2%±0.4%

Model page →View run →

14.Qwen/Qwen3.5-122B-A10B-NVFP4

exact_match60.4%±0.4%

Model page →View run →

15.google/gemma-3-12b-it

exact_match59.7%±0.4%

Model page →View run →

16.MiniMax/MiniMax-M2-AWQ

exact_match59.3%±0.4%

Model page →View run →

17.Qwen/Qwen3-8B

exact_match57.7%±0.4%

Model page →View run →

18.zai-org/GLM-4.5-Air-FP8

exact_match56.8%±0.4%

Model page →View run →

19.Qwen/Qwen3-4B-AWQ

exact_match55.7%±0.4%

Model page →View run →

20.Qwen/Qwen3-4B

exact_match54.3%±0.4%

Model page →View run →

21.microsoft/phi-4-mini-instruct

exact_match51.2%±0.4%

Model page →View run →

22.Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507

exact_match44.7%±0.4%

Model page →View run →

23.openai/gpt-oss-20b

exact_match32.6%±0.4%

Model page →View run →

24.microsoft/phi-4-reasoning-plus

exact_match29.4%±0.4%

Model page →View run →

25.Qwen/Qwen3.6-27B

exact_match26.2%±0.3%

Model page →View run →

26.microsoft/phi-4-mini-reasoning

exact_match7.3%±0.2%

Model page →View run →

27.Qwen/Qwen3.6-35B-A3B

exact_match7.2%±0.2%

Model page →View run →

28.zai-org/GLM-4.5V-FP8

exact_match6.0%±0.2%

Model page →View run →

Caveats

MMLU-Pro is explicitly chain-of-thought-dependent: scores collapse without CoT prompting and are not directly comparable to vanilla few-shot MMLU numbers, so any cross-benchmark comparison must control for prompting protocol. Because the question pool was assembled by filtering and augmenting several pre-existing datasets plus expert curation, provenance and difficulty are heterogeneous across disciplines, and per-discipline results should not be assumed to be calibrated against each other. Although less saturated than MMLU at release, frontier models already exceed 70-75%, narrowing the discriminative window at the top of the leaderboard. Expanding from 4 to 10 options reduces but does not eliminate MCQ-format exploitation — position bias and elimination strategies still apply. The benchmark remains English-only.

How to cite

Citation

FrozeBench. "MMLU-Pro." https://frozebench.com/benchmarks/mmlu-pro. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_mmlu_pro,
  title = {MMLU-Pro},
  howpublished = {\url{https://frozebench.com/benchmarks/mmlu-pro}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/mmlu-pro