GPQA Diamond
GPQA Diamond is the 198-question hardest tier of GPQA, selected by two filters applied jointly: both expert annotators must agree on the correct answer, and at most one of three skilled non-expert validators with unrestricted web access could solve the question. The result is the highest-confidence "Google-proof" subset of the parent benchmark and the variant most commonly reported on frontier-model leaderboards and in research papers. It is the canonical headline number when GPQA is cited.
Source paperLatest run: 2026-05-25
Benchmark results
Switch between the canonical ranking, release-date performance view, and score-size tradeoff.
Caveats
At N=198 the variance is severe. A swing of one correctly-answered question is approximately 0.5 percentage points, so reported gaps under roughly 3pp should be treated as within sampling noise unless they are accompanied by confidence intervals or multi-seed runs. Many published comparisons do not provide either, so cross-paper rankings on GPQA Diamond should be read with caution. The expert-agreement filter that produces the Diamond tier may also introduce a selection effect: by requiring two experts to agree, the subset may favor questions with relatively unambiguous textbook answers and exclude genuinely frontier-research edge cases where experts legitimately disagree. Contamination risk is the same as GPQA Main since the Diamond split was published simultaneously, and top reasoning models now exceed 85%, approaching the ~74% adjusted expert ceiling and signalling imminent saturation.
How to cite
Citation
FrozeBench. "GPQA Diamond." https://frozebench.com/benchmarks/gpqa-diamond. Retrieved 2026-06-04.
BibTeX
@misc{frozebench_gpqa_diamond,
title = {GPQA Diamond},
howpublished = {\url{https://frozebench.com/benchmarks/gpqa-diamond}},
year = {2026},
note = {FrozeBench. Retrieved 2026-06-04.}
}URL
https://frozebench.com/benchmarks/gpqa-diamond