Skip to main content
FrozeBench

IFEval

ifeval
Instruction Following

IFEval (Instruction-Following Evaluation) is a 541-prompt benchmark designed to measure how well a model obeys explicit, programmatically verifiable instructions in its output. Each prompt embeds one or more constraints drawn from a taxonomy of 25 instruction types including length constraints (word/sentence/paragraph counts), keyword inclusion or exclusion, formatting requirements (bullet points, JSON, title case), language-of-output constraints, and structural rules. Compliance is checked by deterministic regex and counting rules with no LLM judge or human annotator in the loop, which makes IFEval cheap and fully reproducible at the cost of checking only what those rules can express.

Source paperLatest run: 2026-05-25

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

28 models

Caveats

IFEval measures format compliance, not content quality. A model can score perfectly on IFEval while producing factually wrong, incoherent, or low-quality output that happens to satisfy the requested format constraints, so a high IFEval score is necessary but not sufficient evidence of useful instruction-following. Treating it as a general-purpose follow-the-instructions metric overstates what it actually measures. The benchmark reports four metrics — prompt-level strict accuracy, prompt-level loose accuracy, instruction-level strict accuracy, and instruction-level loose accuracy — that often diverge by several points, and papers report inconsistently across them, which makes cross-paper IFEval comparisons ambiguous unless the specific metric is named. The loose variants also use regex heuristics that can grant false positives by accepting outputs that do not actually satisfy the stated constraint. Coverage is narrow within the instruction- following space: the 25 instruction types do not include persona adherence, multi-turn instruction maintenance, conditional logic ("do X only if Y"), or many real-world multi-constraint scenarios, and the benchmark is English-only.

How to cite

Citation

FrozeBench. "IFEval." https://frozebench.com/benchmarks/ifeval. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_ifeval,
  title = {IFEval},
  howpublished = {\url{https://frozebench.com/benchmarks/ifeval}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/ifeval