Skip to main content
FrozeBench

BBH

bbh
Reasoning

BBH (BIG-Bench Hard) is a curated suite of 23 challenging tasks extracted from the larger BIG-Bench collection, selected because pre-2022 LLMs underperformed the average human rater on them. The tasks span algorithmic reasoning (boolean expressions, dyck-language tracking, word sorting), language understanding (causal judgement, disambiguation, hyperbaton), date and temporal reasoning, logical deduction, and multi-step inference, totalling roughly 6,500 examples. The original paper introduced BBH alongside the demonstration that chain-of-thought prompting unlocked substantial gains on these tasks relative to standard few-shot prompting, making BBH historically important as evidence for the CoT phenomenon itself.

Source paperLatest run: 2026-05-17

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

28 models

Caveats

BBH is now effectively saturated: frontier models score above 90% on most of the 23 subtasks, which prompted the release of BIG-Bench Extra Hard (BBEH) in 2025 as a successor. The composite BBH score aggregates 23 disparate tasks under different metric definitions (exact match, symbolic match, multi-choice scoring), so a single headline number obscures task-level weaknesses and is not directly interpretable as a uniform difficulty scale. The task selection itself is a snapshot of what was hard for 2022-era models. Several tasks that were challenging then are now trivially solved by current models, biasing the difficulty mix toward historical rather than contemporary failure modes. The set is also static with no held-out variants, so contamination risk grows over time as the data circulates through training corpora. BBH is best treated as a legacy reasoning indicator and a decomposition tool across its few-shot, CoT, and zero-shot variants rather than a discriminative frontier benchmark.

How to cite

Citation

FrozeBench. "BBH." https://frozebench.com/benchmarks/bbh. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_bbh,
  title = {BBH},
  howpublished = {\url{https://frozebench.com/benchmarks/bbh}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/bbh