IFEval

ifeval

Instruction Following

IFEval (Instruction-Following Evaluation) is a 541-prompt benchmark designed to measure how well a model obeys explicit, programmatically verifiable instructions in its output. Each prompt embeds one or more constraints drawn from a taxonomy of 25 instruction types including length constraints (word/sentence/paragraph counts), keyword inclusion or exclusion, formatting requirements (bullet points, JSON, title case), language-of-output constraints, and structural rules. Compliance is checked by deterministic regex and counting rules with no LLM judge or human annotator in the loop, which makes IFEval cheap and fully reproducible at the cost of checking only what those rules can express.

Source paperLatest run: 2026-05-25

Benchmark results

Switch between the canonical ranking, release-date performance view, and score-size tradeoff.

#	Model	prompt_level_strict_acc	Actions
1	google/gemma-4-31B-it	91.3%±1.2%	View run →
2	google/Gemma-4-31B-IT-NVFP4	90.8%±1.2%	View run →
3	Qwen/Qwen3.6-35B-A3B	89.6%±1.3%	View run →
4	google/gemma-4-26B-A4B-it	89.5%±1.3%	View run →
5	Qwen/Qwen3.6-27B	88.9%±1.4%	View run →
6	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	88.4%±1.4%	View run →
7	Qwen/Qwen3-Next-80B-A3B-Instruct	87.1%±1.4%	View run →
8	Qwen/Qwen3.5-122B-A10B-NVFP4	85.4%±1.5%	View run →
9	google/gemma-3-27b-it	82.3%±1.6%	View run →
10	Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8	82.1%±1.7%	View run →
11	Qwen/Qwen3-8B	81.7%±1.7%	View run →
12	google/gemma-3-12b-it	81.0%±1.7%	View run →
13	Qwen/Qwen3-14B	80.2%±1.7%	View run →
14	Qwen/Qwen3-32B	79.5%±1.7%	View run →
15	Qwen/Qwen3-4B	79.3%±1.7%	View run →
16	openai/gpt-oss-120b	78.2%±1.8%	View run →
17	Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507	77.6%±1.8%	View run →
18	Qwen/Qwen3.5-35B-A3B	77.6%±1.8%	View run →
19	zai-org/GLM-4.5-Air-FP8	76.7%±1.8%	View run →
20	Qwen/Qwen3-4B-AWQ	76.5%±1.8%	View run →
21	zai-org/GLM-4.5V-FP8	69.9%±2.0%	View run →
22	microsoft/phi-4-mini-instruct	69.3%±2.0%	View run →
23	microsoft/phi-4	62.7%±2.1%	View run →
24	openai/gpt-oss-20b	60.4%±2.1%	View run →
25	microsoft/phi-4-mini-reasoning	41.8%±2.1%	View run →
26	MiniMax/MiniMax-M2-AWQ	40.3%±2.1%	View run →
27	MiniMax/MiniMax-M2.1-AWQ	35.3%±2.1%	View run →
28	microsoft/phi-4-reasoning-plus	23.3%±1.8%	View run →

28 models

1.google/gemma-4-31B-it

prompt_level_strict_acc91.3%±1.2%

Model page →View run →

2.google/Gemma-4-31B-IT-NVFP4

prompt_level_strict_acc90.8%±1.2%

Model page →View run →

3.Qwen/Qwen3.6-35B-A3B

prompt_level_strict_acc89.6%±1.3%

Model page →View run →

4.google/gemma-4-26B-A4B-it

prompt_level_strict_acc89.5%±1.3%

Model page →View run →

5.Qwen/Qwen3.6-27B

prompt_level_strict_acc88.9%±1.4%

Model page →View run →

6.nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

prompt_level_strict_acc88.4%±1.4%

Model page →View run →

7.Qwen/Qwen3-Next-80B-A3B-Instruct

prompt_level_strict_acc87.1%±1.4%

Model page →View run →

8.Qwen/Qwen3.5-122B-A10B-NVFP4

prompt_level_strict_acc85.4%±1.5%

Model page →View run →

9.google/gemma-3-27b-it

prompt_level_strict_acc82.3%±1.6%

Model page →View run →

10.Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

prompt_level_strict_acc82.1%±1.7%

Model page →View run →

11.Qwen/Qwen3-8B

prompt_level_strict_acc81.7%±1.7%

Model page →View run →

12.google/gemma-3-12b-it

prompt_level_strict_acc81.0%±1.7%

Model page →View run →

13.Qwen/Qwen3-14B

prompt_level_strict_acc80.2%±1.7%

Model page →View run →

14.Qwen/Qwen3-32B

prompt_level_strict_acc79.5%±1.7%

Model page →View run →

15.Qwen/Qwen3-4B

prompt_level_strict_acc79.3%±1.7%

Model page →View run →

16.openai/gpt-oss-120b

prompt_level_strict_acc78.2%±1.8%

Model page →View run →

17.Qwen/Qwen3-235B-A22B-Thinking-AWQ-2507

prompt_level_strict_acc77.6%±1.8%

Model page →View run →

18.Qwen/Qwen3.5-35B-A3B

prompt_level_strict_acc77.6%±1.8%

Model page →View run →

19.zai-org/GLM-4.5-Air-FP8

prompt_level_strict_acc76.7%±1.8%

Model page →View run →

20.Qwen/Qwen3-4B-AWQ

prompt_level_strict_acc76.5%±1.8%

Model page →View run →

21.zai-org/GLM-4.5V-FP8

prompt_level_strict_acc69.9%±2.0%

Model page →View run →

22.microsoft/phi-4-mini-instruct

prompt_level_strict_acc69.3%±2.0%

Model page →View run →

23.microsoft/phi-4

prompt_level_strict_acc62.7%±2.1%

Model page →View run →

24.openai/gpt-oss-20b

prompt_level_strict_acc60.4%±2.1%

Model page →View run →

25.microsoft/phi-4-mini-reasoning

prompt_level_strict_acc41.8%±2.1%

Model page →View run →

26.MiniMax/MiniMax-M2-AWQ

prompt_level_strict_acc40.3%±2.1%

Model page →View run →

27.MiniMax/MiniMax-M2.1-AWQ

prompt_level_strict_acc35.3%±2.1%

Model page →View run →

28.microsoft/phi-4-reasoning-plus

prompt_level_strict_acc23.3%±1.8%

Model page →View run →

Caveats

IFEval measures format compliance, not content quality. A model can score perfectly on IFEval while producing factually wrong, incoherent, or low-quality output that happens to satisfy the requested format constraints, so a high IFEval score is necessary but not sufficient evidence of useful instruction-following. Treating it as a general-purpose follow-the-instructions metric overstates what it actually measures. The benchmark reports four metrics — prompt-level strict accuracy, prompt-level loose accuracy, instruction-level strict accuracy, and instruction-level loose accuracy — that often diverge by several points, and papers report inconsistently across them, which makes cross-paper IFEval comparisons ambiguous unless the specific metric is named. The loose variants also use regex heuristics that can grant false positives by accepting outputs that do not actually satisfy the stated constraint. Coverage is narrow within the instruction- following space: the 25 instruction types do not include persona adherence, multi-turn instruction maintenance, conditional logic ("do X only if Y"), or many real-world multi-constraint scenarios, and the benchmark is English-only.

How to cite

Citation

FrozeBench. "IFEval." https://frozebench.com/benchmarks/ifeval. Retrieved 2026-06-04.

BibTeX

@misc{frozebench_ifeval,
  title = {IFEval},
  howpublished = {\url{https://frozebench.com/benchmarks/ifeval}},
  year = {2026},
  note = {FrozeBench. Retrieved 2026-06-04.}
}

URL

https://frozebench.com/benchmarks/ifeval