Skip to main content
FrozeBench

GSM8K

gsm8k
MathReasoning

GSM8K (Grade School Math 8K) is a dataset of 8,500 linguistically diverse English-language grade-school math word problems, split into 7,473 training and 1,319 test items, each requiring between two and eight elementary arithmetic steps to solve. The problems were authored by professional human problem-writers, with reference solutions that include natural-language reasoning chains in addition to the final numeric answer. GSM8K became the standard probe for whether a language model can chain together multiple arithmetic steps under natural- language framing, and it remains a near-mandatory entry on foundation-model evaluation cards.

Source paperLatest run: 2026-05-19

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

28 models

Caveats

Frontier models now score above 95% on GSM8K, so the benchmark no longer differentiates leading systems and serves mostly as a sanity check that a model can do basic chained arithmetic at all. Beyond saturation, multiple lines of evidence suggest reported scores overstate genuine reasoning ability. Mirzadeh et al. (2024, GSM-Symbolic) showed that simply renaming variables or perturbing the numeric values in GSM8K problems causes large accuracy drops and high variance across frontier models, consistent with template matching rather than reasoning. Zhang et al. (2024, GSM1k) constructed a contamination-free parallel set in the same style and observed accuracy drops of up to ~8pp for some model families, indicating systematic overfitting to the public test split. The scope is also narrow: GSM8K is arithmetic-only and does not probe symbolic, algebraic, or proof-style mathematical reasoning, and it is English-only with no multilingual variant inside the dataset itself (MGSM exists for that purpose). High GSM8K scores should not be read as evidence of mathematical capability beyond chained arithmetic.

How to cite

Citation

FrozeBench. "GSM8K." https://frozebench.com/benchmarks/gsm8k. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_gsm8k,
  title = {GSM8K},
  howpublished = {\url{https://frozebench.com/benchmarks/gsm8k}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/gsm8k