GPQA Diamond

gpqa_diamond

KnowledgeScienceReasoning

GPQA Diamond is the 198-question hardest tier of GPQA, selected by two filters applied jointly: both expert annotators must agree on the correct answer, and at most one of three skilled non-expert validators with unrestricted web access could solve the question. The result is the highest-confidence "Google-proof" subset of the parent benchmark and the variant most commonly reported on frontier-model leaderboards and in research papers. It is the canonical headline number when GPQA is cited.

Source paperLatest run: 2026-05-25

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

#	Model	acc	Actions
1	google/Gemma-4-31B-IT-NVFP4	52.5%±3.6%	View run →
2	google/gemma-4-31B-it	52.0%±3.6%	View run →
3	google/gemma-4-26B-A4B-it	41.4%±3.5%	View run →
4	google/gemma-3-27b-it	39.4%±3.5%	View run →
5	Qwen/Qwen3-Next-80B-A3B-Instruct	38.9%±3.5%	View run →
6	microsoft/phi-4	35.9%±3.4%	View run →
7	google/gemma-3-12b-it	35.9%±3.4%	View run →
8	microsoft/phi-4-mini-instruct	33.8%±3.4%	View run →
9	Qwen/Qwen3-4B-AWQ	32.3%±3.3%	View run →
10	Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8	30.8%±3.3%	View run →
11	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	29.8%±3.3%	View run →
12	Qwen/Qwen3.5-35B-A3B	29.8%±3.3%	View run →
13	Qwen/Qwen3-32B	29.3%±3.2%	View run →
14	Qwen/Qwen3-14B	29.3%±3.2%	View run →
15	openai/gpt-oss-120b	28.8%±3.2%	View run →
16	Qwen/Qwen3.6-27B	28.8%±3.2%	View run →
17	MiniMax/MiniMax-M2.1-AWQ	28.8%±3.2%	View run →
18	Qwen/Qwen3.5-122B-A10B-NVFP4	28.3%±3.2%	View run →
19	zai-org/GLM-4.5V-FP8	28.3%±3.2%	View run →
20	Qwen/Qwen3-8B	27.8%±3.2%	View run →
21	microsoft/phi-4-reasoning-plus	27.8%±3.2%	View run →
22	Qwen/Qwen3-4B	26.8%±3.2%	View run →
23	openai/gpt-oss-20b	26.8%±3.2%	View run →
24	Qwen/Qwen3.6-35B-A3B	26.3%±3.1%	View run →
25	MiniMax/MiniMax-M2-AWQ	25.8%±3.1%	View run →
26	microsoft/phi-4-mini-reasoning	25.8%±3.1%	View run →
27	Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507	25.3%±3.1%	View run →
28	zai-org/GLM-4.5-Air-FP8	24.2%±3.1%	View run →

28 models

1.google/Gemma-4-31B-IT-NVFP4

acc52.5%±3.6%

Model page →View run →

2.google/gemma-4-31B-it

acc52.0%±3.6%

Model page →View run →

3.google/gemma-4-26B-A4B-it

acc41.4%±3.5%

Model page →View run →

4.google/gemma-3-27b-it

acc39.4%±3.5%

Model page →View run →

5.Qwen/Qwen3-Next-80B-A3B-Instruct

acc38.9%±3.5%

Model page →View run →

6.microsoft/phi-4

acc35.9%±3.4%

Model page →View run →

7.google/gemma-3-12b-it

acc35.9%±3.4%

Model page →View run →

8.microsoft/phi-4-mini-instruct

acc33.8%±3.4%

Model page →View run →

9.Qwen/Qwen3-4B-AWQ

acc32.3%±3.3%

Model page →View run →

10.Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

acc30.8%±3.3%

Model page →View run →

11.nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

acc29.8%±3.3%

Model page →View run →

12.Qwen/Qwen3.5-35B-A3B

acc29.8%±3.3%

Model page →View run →

13.Qwen/Qwen3-32B

acc29.3%±3.2%

Model page →View run →

14.Qwen/Qwen3-14B

acc29.3%±3.2%

Model page →View run →

15.openai/gpt-oss-120b

acc28.8%±3.2%

Model page →View run →

16.Qwen/Qwen3.6-27B

acc28.8%±3.2%

Model page →View run →

17.MiniMax/MiniMax-M2.1-AWQ

acc28.8%±3.2%

Model page →View run →

18.Qwen/Qwen3.5-122B-A10B-NVFP4

acc28.3%±3.2%

Model page →View run →

19.zai-org/GLM-4.5V-FP8

acc28.3%±3.2%

Model page →View run →

20.Qwen/Qwen3-8B

acc27.8%±3.2%

Model page →View run →

21.microsoft/phi-4-reasoning-plus

acc27.8%±3.2%

Model page →View run →

22.Qwen/Qwen3-4B

acc26.8%±3.2%

Model page →View run →

23.openai/gpt-oss-20b

acc26.8%±3.2%

Model page →View run →

24.Qwen/Qwen3.6-35B-A3B

acc26.3%±3.1%

Model page →View run →

25.MiniMax/MiniMax-M2-AWQ

acc25.8%±3.1%

Model page →View run →

26.microsoft/phi-4-mini-reasoning

acc25.8%±3.1%

Model page →View run →

27.Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507

acc25.3%±3.1%

Model page →View run →

28.zai-org/GLM-4.5-Air-FP8

acc24.2%±3.1%

Model page →View run →

Caveats

At N=198 the variance is severe. A swing of one correctly-answered question is approximately 0.5 percentage points, so reported gaps under roughly 3pp should be treated as within sampling noise unless they are accompanied by confidence intervals or multi-seed runs. Many published comparisons do not provide either, so cross-paper rankings on GPQA Diamond should be read with caution. The expert-agreement filter that produces the Diamond tier may also introduce a selection effect: by requiring two experts to agree, the subset may favor questions with relatively unambiguous textbook answers and exclude genuinely frontier-research edge cases where experts legitimately disagree. Contamination risk is the same as GPQA Main since the Diamond split was published simultaneously, and top reasoning models now exceed 85%, approaching the ~74% adjusted expert ceiling and signalling imminent saturation.

How to cite

Citation

FrozeBench. "GPQA Diamond." https://frozebench.com/benchmarks/gpqa-diamond. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_gpqa_diamond,
  title = {GPQA Diamond},
  howpublished = {\url{https://frozebench.com/benchmarks/gpqa-diamond}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/gpqa-diamond