MMLU

mmlu

KnowledgeReasoning

MMLU (Massive Multitask Language Understanding) is a broad-coverage knowledge and reasoning benchmark spanning 57 subjects across STEM, the humanities, the social sciences, law, medicine, and other professional domains. It contains 14,079 four-choice multiple-choice test items plus a 1,540-item dev/validation split used for few-shot example selection, with difficulty calibrated from elementary up to advanced-professional level. Released in 2020, it became the de facto industry standard for measuring an LLM's breadth of world knowledge and remains one of the most widely cited LLM benchmarks despite its age.

Source paperLatest run: 2026-05-26

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

#	Model	acc	Actions
1	Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507	84.7%±0.3%	View run →
2	Qwen/Qwen3.6-27B	84.5%±0.3%	View run →
3	google/gemma-4-31B-it	82.7%±0.3%	View run →
4	google/Gemma-4-31B-IT-NVFP4	82.1%±0.3%	View run →
5	MiniMax/MiniMax-M2-AWQ	81.6%±0.3%	View run →
6	zai-org/GLM-4.5V-FP8	81.0%±0.3%	View run →
7	Qwen/Qwen3-32B	80.8%±0.3%	View run →
8	microsoft/phi-4-reasoning-plus	77.6%±0.3%	View run →
9	Qwen/Qwen3-14B	77.2%±0.3%	View run →
10	microsoft/phi-4	76.3%±0.3%	View run →
11	google/gemma-3-27b-it	74.1%±0.4%	View run →
12	Qwen/Qwen3-8B	73.0%±0.4%	View run →
13	google/gemma-3-12b-it	70.7%±0.4%	View run →
14	Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8	68.9%±0.4%	View run →
15	Qwen/Qwen3-4B	68.4%±0.4%	View run →
16	Qwen/Qwen3-4B-AWQ	67.2%±0.4%	View run →
17	microsoft/phi-4-mini-instruct	66.3%±0.4%	View run →
18	Qwen/Qwen3-Next-80B-A3B-Instruct	64.0%±0.4%	View run →
19	zai-org/GLM-4.5-Air-FP8	57.5%±0.4%	View run →
20	microsoft/phi-4-mini-reasoning	57.3%±0.4%	View run →
21	openai/gpt-oss-20b	56.4%±0.4%	View run →
22	google/gemma-4-26B-A4B-it	47.9%±0.4%	View run →
23	Qwen/Qwen3.6-35B-A3B	37.9%±0.4%	View run →
24	Qwen/Qwen3.5-35B-A3B	37.9%±0.4%	View run →
25	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	36.2%±0.4%	View run →
26	Qwen/Qwen3.5-122B-A10B-NVFP4	27.1%±0.4%	View run →
27	MiniMax/MiniMax-M2.1-AWQ	25.5%±0.4%	View run →
28	openai/gpt-oss-120b	24.7%±0.4%	View run →

28 models

1.Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507

acc84.7%±0.3%

Model page →View run →

2.Qwen/Qwen3.6-27B

acc84.5%±0.3%

Model page →View run →

3.google/gemma-4-31B-it

acc82.7%±0.3%

Model page →View run →

4.google/Gemma-4-31B-IT-NVFP4

acc82.1%±0.3%

Model page →View run →

5.MiniMax/MiniMax-M2-AWQ

acc81.6%±0.3%

Model page →View run →

6.zai-org/GLM-4.5V-FP8

acc81.0%±0.3%

Model page →View run →

7.Qwen/Qwen3-32B

acc80.8%±0.3%

Model page →View run →

8.microsoft/phi-4-reasoning-plus

acc77.6%±0.3%

Model page →View run →

9.Qwen/Qwen3-14B

acc77.2%±0.3%

Model page →View run →

10.microsoft/phi-4

acc76.3%±0.3%

Model page →View run →

11.google/gemma-3-27b-it

acc74.1%±0.4%

Model page →View run →

12.Qwen/Qwen3-8B

acc73.0%±0.4%

Model page →View run →

13.google/gemma-3-12b-it

acc70.7%±0.4%

Model page →View run →

14.Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

acc68.9%±0.4%

Model page →View run →

15.Qwen/Qwen3-4B

acc68.4%±0.4%

Model page →View run →

16.Qwen/Qwen3-4B-AWQ

acc67.2%±0.4%

Model page →View run →

17.microsoft/phi-4-mini-instruct

acc66.3%±0.4%

Model page →View run →

18.Qwen/Qwen3-Next-80B-A3B-Instruct

acc64.0%±0.4%

Model page →View run →

19.zai-org/GLM-4.5-Air-FP8

acc57.5%±0.4%

Model page →View run →

20.microsoft/phi-4-mini-reasoning

acc57.3%±0.4%

Model page →View run →

21.openai/gpt-oss-20b

acc56.4%±0.4%

Model page →View run →

22.google/gemma-4-26B-A4B-it

acc47.9%±0.4%

Model page →View run →

23.Qwen/Qwen3.6-35B-A3B

acc37.9%±0.4%

Model page →View run →

24.Qwen/Qwen3.5-35B-A3B

acc37.9%±0.4%

Model page →View run →

25.nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

acc36.2%±0.4%

Model page →View run →

26.Qwen/Qwen3.5-122B-A10B-NVFP4

acc27.1%±0.4%

Model page →View run →

27.MiniMax/MiniMax-M2.1-AWQ

acc25.5%±0.4%

Model page →View run →

28.openai/gpt-oss-120b

acc24.7%±0.4%

Model page →View run →

Caveats

Label noise is a meaningful and uneven concern. Gema et al. (2024, "Are We Done with MMLU?") manually re-annotated 5,700 questions and flagged roughly 6.5% as having errors — wrong answer keys, ambiguous options, or duplicated stems — with the rate clustering unevenly by subject (Virology had about 57% of reviewed questions flagged). This makes fine-grained subject-level comparisons across models unreliable without controlling for the label-quality floor. The benchmark is also effectively saturated: frontier models now score above 90%, leaving little discriminative headroom near the top. Public availability since 2020 means the test set is widely contaminated in modern training corpora, so high scores partially reflect memorization rather than capability. The four-choice MCQ format is gameable through option-position bias and process-of-elimination heuristics that do not require true understanding, and the benchmark is English-only with no multilingual coverage. Treat MMLU as a coarse breadth indicator rather than a discriminative measure of frontier capability.

How to cite

Citation

FrozeBench. "MMLU." https://frozebench.com/benchmarks/mmlu. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_mmlu,
  title = {MMLU},
  howpublished = {\url{https://frozebench.com/benchmarks/mmlu}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/mmlu