BBH

bbh

Reasoning

BBH (BIG-Bench Hard) is a curated suite of 23 challenging tasks extracted from the larger BIG-Bench collection, selected because pre-2022 LLMs underperformed the average human rater on them. The tasks span algorithmic reasoning (boolean expressions, dyck-language tracking, word sorting), language understanding (causal judgement, disambiguation, hyperbaton), date and temporal reasoning, logical deduction, and multi-step inference, totalling roughly 6,500 examples. The original paper introduced BBH alongside the demonstration that chain-of-thought prompting unlocked substantial gains on these tasks relative to standard few-shot prompting, making BBH historically important as evidence for the CoT phenomenon itself.

Source paperLatest run: 2026-05-17

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

#	Model	exact_match	Actions
1	google/Gemma-4-31B-IT-NVFP4	79.7%±0.4%	View run →
2	google/gemma-4-31B-it	78.9%±0.3%	View run →
3	Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8	75.8%±0.4%	View run →
4	Qwen/Qwen3-4B-AWQ	72.7%±0.5%	View run →
5	google/gemma-3-27b-it	62.5%±0.4%	View run →
6	google/gemma-3-12b-it	62.3%±0.4%	View run →
7	Qwen/Qwen3.5-122B-A10B-NVFP4	49.7%±0.5%	View run →
8	Qwen/Qwen3.5-35B-A3B	49.6%±0.5%	View run →
9	google/gemma-4-26B-A4B-it	48.1%±0.4%	View run →
10	Qwen/Qwen3-32B	47.0%±0.4%	View run →
11	Qwen/Qwen3-14B	41.8%±0.5%	View run →
12	microsoft/phi-4-mini-instruct	40.0%±0.5%	View run →
13	Qwen/Qwen3.6-27B	31.2%±0.5%	View run →
14	Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507	27.8%±0.4%	View run →
15	Qwen/Qwen3.6-35B-A3B	24.1%±0.4%	View run →
16	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	18.1%±0.4%	View run →
17	Qwen/Qwen3-8B	13.2%±0.3%	View run →
18	microsoft/phi-4-reasoning-plus	3.3%±0.2%	View run →
19	microsoft/phi-4-mini-reasoning	2.9%±0.2%	View run →
20	zai-org/GLM-4.5V-FP8	2.7%±0.2%	View run →
21	Qwen/Qwen3-Next-80B-A3B-Instruct	2.3%±0.2%	View run →
22	openai/gpt-oss-20b	1.5%±0.1%	View run →
23	microsoft/phi-4	0.4%±0.1%	View run →
24	MiniMax/MiniMax-M2-AWQ	0.1%±0.0%	View run →
25	MiniMax/MiniMax-M2.1-AWQ	0.1%±0.0%	View run →
26	openai/gpt-oss-120b	0.0%±0.0%	View run →
27	zai-org/GLM-4.5-Air-FP8	0.0%±0.0%	View run →
28	Qwen/Qwen3-4B	0.0%±0.0%	View run →

28 models

1.google/Gemma-4-31B-IT-NVFP4

exact_match79.7%±0.4%

Model page →View run →

2.google/gemma-4-31B-it

exact_match78.9%±0.3%

Model page →View run →

3.Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

exact_match75.8%±0.4%

Model page →View run →

4.Qwen/Qwen3-4B-AWQ

exact_match72.7%±0.5%

Model page →View run →

5.google/gemma-3-27b-it

exact_match62.5%±0.4%

Model page →View run →

6.google/gemma-3-12b-it

exact_match62.3%±0.4%

Model page →View run →

7.Qwen/Qwen3.5-122B-A10B-NVFP4

exact_match49.7%±0.5%

Model page →View run →

8.Qwen/Qwen3.5-35B-A3B

exact_match49.6%±0.5%

Model page →View run →

9.google/gemma-4-26B-A4B-it

exact_match48.1%±0.4%

Model page →View run →

10.Qwen/Qwen3-32B

exact_match47.0%±0.4%

Model page →View run →

11.Qwen/Qwen3-14B

exact_match41.8%±0.5%

Model page →View run →

12.microsoft/phi-4-mini-instruct

exact_match40.0%±0.5%

Model page →View run →

13.Qwen/Qwen3.6-27B

exact_match31.2%±0.5%

Model page →View run →

14.Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507

exact_match27.8%±0.4%

Model page →View run →

15.Qwen/Qwen3.6-35B-A3B

exact_match24.1%±0.4%

Model page →View run →

16.nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

exact_match18.1%±0.4%

Model page →View run →

17.Qwen/Qwen3-8B

exact_match13.2%±0.3%

Model page →View run →

18.microsoft/phi-4-reasoning-plus

exact_match3.3%±0.2%

Model page →View run →

19.microsoft/phi-4-mini-reasoning

exact_match2.9%±0.2%

Model page →View run →

20.zai-org/GLM-4.5V-FP8

exact_match2.7%±0.2%

Model page →View run →

21.Qwen/Qwen3-Next-80B-A3B-Instruct

exact_match2.3%±0.2%

Model page →View run →

22.openai/gpt-oss-20b

exact_match1.5%±0.1%

Model page →View run →

23.microsoft/phi-4

exact_match0.4%±0.1%

Model page →View run →

24.MiniMax/MiniMax-M2-AWQ

exact_match0.1%±0.0%

Model page →View run →

25.MiniMax/MiniMax-M2.1-AWQ

exact_match0.1%±0.0%

Model page →View run →

26.openai/gpt-oss-120b

exact_match0.0%±0.0%

Model page →View run →

27.zai-org/GLM-4.5-Air-FP8

exact_match0.0%±0.0%

Model page →View run →

28.Qwen/Qwen3-4B

exact_match0.0%±0.0%

Model page →View run →

Caveats

BBH is now effectively saturated: frontier models score above 90% on most of the 23 subtasks, which prompted the release of BIG-Bench Extra Hard (BBEH) in 2025 as a successor. The composite BBH score aggregates 23 disparate tasks under different metric definitions (exact match, symbolic match, multi-choice scoring), so a single headline number obscures task-level weaknesses and is not directly interpretable as a uniform difficulty scale. The task selection itself is a snapshot of what was hard for 2022-era models. Several tasks that were challenging then are now trivially solved by current models, biasing the difficulty mix toward historical rather than contemporary failure modes. The set is also static with no held-out variants, so contamination risk grows over time as the data circulates through training corpora. BBH is best treated as a legacy reasoning indicator and a decomposition tool across its few-shot, CoT, and zero-shot variants rather than a discriminative frontier benchmark.

How to cite

Citation

FrozeBench. "BBH." https://frozebench.com/benchmarks/bbh. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_bbh,
  title = {BBH},
  howpublished = {\url{https://frozebench.com/benchmarks/bbh}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/bbh