Skip to main content
FrozeBench

GPQA Diamond

gpqa_diamond
KnowledgeScienceReasoning

GPQA Diamond is the 198-question hardest tier of GPQA, selected by two filters applied jointly: both expert annotators must agree on the correct answer, and at most one of three skilled non-expert validators with unrestricted web access could solve the question. The result is the highest-confidence "Google-proof" subset of the parent benchmark and the variant most commonly reported on frontier-model leaderboards and in research papers. It is the canonical headline number when GPQA is cited.

Source paperLatest run: 2026-05-25

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

28 models

Caveats

At N=198 the variance is severe. A swing of one correctly-answered question is approximately 0.5 percentage points, so reported gaps under roughly 3pp should be treated as within sampling noise unless they are accompanied by confidence intervals or multi-seed runs. Many published comparisons do not provide either, so cross-paper rankings on GPQA Diamond should be read with caution. The expert-agreement filter that produces the Diamond tier may also introduce a selection effect: by requiring two experts to agree, the subset may favor questions with relatively unambiguous textbook answers and exclude genuinely frontier-research edge cases where experts legitimately disagree. Contamination risk is the same as GPQA Main since the Diamond split was published simultaneously, and top reasoning models now exceed 85%, approaching the ~74% adjusted expert ceiling and signalling imminent saturation.

How to cite

Citation

FrozeBench. "GPQA Diamond." https://frozebench.com/benchmarks/gpqa-diamond. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_gpqa_diamond,
  title = {GPQA Diamond},
  howpublished = {\url{https://frozebench.com/benchmarks/gpqa-diamond}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/gpqa-diamond