Skip to main content
FrozeBench

LongBench

longbench
Long Context Understanding

LongBench is the first major bilingual long-context benchmark for LLMs, comprising 21 datasets organized into six task families: single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks (e.g. passage retrieval and counting), and code completion. The English splits average roughly 6,700 words per input and the Chinese splits roughly 13,400 characters, with individual examples reaching tens of thousands of tokens. LongBench was designed to evaluate whether models can actually use long contexts rather than merely accept them as input, and it became one of the most widely cited long-context benchmarks following its 2023 release.

Source paperLatest run: 2026-05-18

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

27 models

Caveats

LongBench's subtasks emit heterogeneous primary metrics (qa_f1_score, rouge_score, code_sim_score, classification_score, and others), so the benchmark-level primary_metric only resolves cleanly against the QA-style subtask leaves. Runs that target the full LongBench root may surface BENCHMARK_PRIMARY_METRIC_MISSING in this system until the benchmark's aggregate policy is decided, and any LongBench composite score reported in papers averages across incomparable metric scales (F1, ROUGE, exact match, code similarity), so a model can game the headline number by performing well on a single family. Per-subtask reporting is preferable to a single LongBench score whenever discriminating between models matters. The "long-context" framing is also less demanding than it once was. Context lengths in LongBench peak around 30k tokens at the time of publication, well within the context windows of current frontier models, so high LongBench scores no longer indicate strong long- context behavior at the lengths that matter for contemporary retrieval-augmented or agentic workloads. Many subtasks are also retrieval-style with answers concentrated near the start or end of the context, which means positional biases can confound claims of genuine long-range understanding. Coverage is bilingual (English and Chinese only) and per-subtask sample sizes are sometimes only a few hundred examples, amplifying variance. Finally, the few-shot learning subtasks are arguably out of scope for a long-context benchmark — they probe in-context learning more than long-range reading comprehension — so LongBench mixes long-context and in-context-learning signal under one composite.

How to cite

Citation

FrozeBench. "LongBench." https://frozebench.com/benchmarks/longbench. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_longbench,
  title = {LongBench},
  howpublished = {\url{https://frozebench.com/benchmarks/longbench}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/longbench