LongBench

longbench

Long Context Understanding

LongBench is the first major bilingual long-context benchmark for LLMs, comprising 21 datasets organized into six task families: single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks (e.g. passage retrieval and counting), and code completion. The English splits average roughly 6,700 words per input and the Chinese splits roughly 13,400 characters, with individual examples reaching tens of thousands of tokens. LongBench was designed to evaluate whether models can actually use long contexts rather than merely accept them as input, and it became one of the most widely cited long-context benchmarks following its 2023 release.

Source paperLatest run: 2026-05-18

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

#	Model	score	Actions
1	Qwen/Qwen3-Next-80B-A3B-Instruct	49.2%±0.5%	View run →
2	google/gemma-3-12b-it	46.2%±0.5%	View run →
3	google/Gemma-4-31B-IT-NVFP4	42.5%±0.5%	View run →
4	microsoft/phi-4-mini-instruct	40.9%±0.5%	View run →
5	google/gemma-4-26B-A4B-it	40.2%±0.5%	View run →
6	google/gemma-3-27b-it	37.0%±0.5%	View run →
7	google/gemma-4-31B-it	37.0%±0.4%	View run →
8	Qwen/Qwen3.6-27B	36.6%±0.4%	View run →
9	Qwen/Qwen3.5-122B-A10B-NVFP4	35.5%±0.4%	View run →
10	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	35.2%±0.4%	View run →
11	Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8	32.4%±0.4%	View run →
12	Qwen/Qwen3-8B	32.1%±0.4%	View run →
13	Qwen/Qwen3-4B-AWQ	31.9%±0.4%	View run →
14	Qwen/Qwen3.5-35B-A3B	31.9%±0.4%	View run →
15	microsoft/phi-4	31.7%±0.4%	View run →
16	Qwen/Qwen3.6-35B-A3B	31.5%±0.4%	View run →
17	zai-org/GLM-4.5V-FP8	27.3%±0.4%	View run →
18	MiniMax/MiniMax-M2-AWQ	19.0%±0.3%	View run →
19	Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507	17.7%±0.3%	View run →
20	microsoft/phi-4-mini-reasoning	14.8%±0.3%	View run →
21	Qwen/Qwen3-4B	14.6%±0.2%	View run →
22	openai/gpt-oss-120b	14.3%±0.3%	View run →
23	openai/gpt-oss-20b	13.5%±0.3%	View run →
24	microsoft/phi-4-reasoning-plus	10.2%±0.2%	View run →
25	zai-org/GLM-4.5-Air-FP8	8.3%±0.1%	View run →
26	Qwen/Qwen3-14B	8.1%±0.1%	View run →
27	Qwen/Qwen3-32B	7.6%±0.1%	View run →

27 models

1.Qwen/Qwen3-Next-80B-A3B-Instruct

score49.2%±0.5%

Model page →View run →

2.google/gemma-3-12b-it

aggregate46.2%±0.5%

Model page →View run →

3.google/Gemma-4-31B-IT-NVFP4

score42.5%±0.5%

Model page →View run →

4.microsoft/phi-4-mini-instruct

aggregate40.9%±0.5%

Model page →View run →

5.google/gemma-4-26B-A4B-it

score40.2%±0.5%

Model page →View run →

6.google/gemma-3-27b-it

aggregate37.0%±0.5%

Model page →View run →

7.google/gemma-4-31B-it

score37.0%±0.4%

Model page →View run →

8.Qwen/Qwen3.6-27B

score36.6%±0.4%

Model page →View run →

9.Qwen/Qwen3.5-122B-A10B-NVFP4

score35.5%±0.4%

Model page →View run →

10.nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

score35.2%±0.4%

Model page →View run →

11.Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

score32.4%±0.4%

Model page →View run →

12.Qwen/Qwen3-8B

aggregate32.1%±0.4%

Model page →View run →

13.Qwen/Qwen3-4B-AWQ

aggregate31.9%±0.4%

Model page →View run →

14.Qwen/Qwen3.5-35B-A3B

score31.9%±0.4%

Model page →View run →

15.microsoft/phi-4

aggregate31.7%±0.4%

Model page →View run →

16.Qwen/Qwen3.6-35B-A3B

score31.5%±0.4%

Model page →View run →

17.zai-org/GLM-4.5V-FP8

aggregate27.3%±0.4%

Model page →View run →

18.MiniMax/MiniMax-M2-AWQ

aggregate19.0%±0.3%

Model page →View run →

19.Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507

aggregate17.7%±0.3%

Model page →View run →

20.microsoft/phi-4-mini-reasoning

aggregate14.8%±0.3%

Model page →View run →

21.Qwen/Qwen3-4B

aggregate14.6%±0.2%

Model page →View run →

22.openai/gpt-oss-120b

aggregate14.3%±0.3%

Model page →View run →

23.openai/gpt-oss-20b

aggregate13.5%±0.3%

Model page →View run →

24.microsoft/phi-4-reasoning-plus

aggregate10.2%±0.2%

Model page →View run →

25.zai-org/GLM-4.5-Air-FP8

aggregate8.3%±0.1%

Model page →View run →

26.Qwen/Qwen3-14B

aggregate8.1%±0.1%

Model page →View run →

27.Qwen/Qwen3-32B

aggregate7.6%±0.1%

Model page →View run →

Caveats

LongBench's subtasks emit heterogeneous primary metrics (qa_f1_score, rouge_score, code_sim_score, classification_score, and others), so the benchmark-level primary_metric only resolves cleanly against the QA-style subtask leaves. Runs that target the full LongBench root may surface BENCHMARK_PRIMARY_METRIC_MISSING in this system until the benchmark's aggregate policy is decided, and any LongBench composite score reported in papers averages across incomparable metric scales (F1, ROUGE, exact match, code similarity), so a model can game the headline number by performing well on a single family. Per-subtask reporting is preferable to a single LongBench score whenever discriminating between models matters. The "long-context" framing is also less demanding than it once was. Context lengths in LongBench peak around 30k tokens at the time of publication, well within the context windows of current frontier models, so high LongBench scores no longer indicate strong long- context behavior at the lengths that matter for contemporary retrieval-augmented or agentic workloads. Many subtasks are also retrieval-style with answers concentrated near the start or end of the context, which means positional biases can confound claims of genuine long-range understanding. Coverage is bilingual (English and Chinese only) and per-subtask sample sizes are sometimes only a few hundred examples, amplifying variance. Finally, the few-shot learning subtasks are arguably out of scope for a long-context benchmark — they probe in-context learning more than long-range reading comprehension — so LongBench mixes long-context and in-context-learning signal under one composite.

How to cite

Citation

FrozeBench. "LongBench." https://frozebench.com/benchmarks/longbench. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_longbench,
  title = {LongBench},
  howpublished = {\url{https://frozebench.com/benchmarks/longbench}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/longbench