Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

TL;DRIt’s critical LLMs follow user instructions. While prior studies assess instruction adherence in the model’s main responses, we argue that it is also important for large reasoning models (LRMs) to follow user instructions throughout their reasoning process.We introduce ReasonIF, a systematic benchmark for assessing reasoning instruction following abilities across multilingual reasoning, formatting, and length control.We find frontier LRMs, including GPT-OSS-120B, Qwen3-235B, and DeepSeek-R1 fail to follow reasoning instructions more than 75% of time. Notably, as task difficulty increases, reasoning instruction following degrades further.For more information, please find our paper and GitHub repository.IntroductionFrom exploring research ideas to building large‑scale software systems and making informed decisions, large reasoning models (LRMs), which generate step-by-step reasoning traces between special tags (e.g., <think>...</think> in DeepSeek family models, and <|channel|>analysis<|message|>...<|end|> in GPT-OSS family models), have rapidly become popularized. Their reasoning ability not only improves interpretability but also allows for iterative refinement, making LRMs highly effective in tasks requiring extensive reasoning. At Together AI, we're thrilled to see explosive interest in LRMs across the entire AI lifecycle — yet, a key question remains:Do these high-performing models follow user instructions in their reasoning trace?Following user instructions throughout the reasoning trace — not just in the final response — improves controllability, transparency, and safety.Process-level instruction following makes interactions more predictable and user-centered, allowing users to guide how the model thinks, not just what it outputs.Structured reasoning traces (e.g., JSON steps, cited evidence) enable programmatic auditing for logic and compliance.Consistent adherence to reasoning instructions helps prevent reward hacking and shortcuts that produce superficially correct answers.Faithful, instruction-aligned reasoning is also more robust to adversarial manipulation, since explicit user-defined rules constrain the model's internal steps.Motivated by these points, we introduce a new benchmark and evaluate how faithfully LRMs follow instructions when producing reasoning traces. Our key findings: while models generally comply in their final responses, they fail far more often in their reasoning steps — and this shortfall worsens with task difficulty.2. ReasonIF: A new benchmark datasetTo push the field forward, we introduce ReasonIF, a new benchmark dataset designed to evaluate instruction‑following abilities within reasoning traces. ReasonIF consists of 300 math and science problems, each paired with a concrete reasoning instruction. Every input prompt comprises two components.A question sampled from established benchmark collections (GSM8K, AMC, AIME, GPQA‑diamond, and ARC‑Challenge), ensuring a broad spectrum of reasoning styles.An instruction randomly selected from a set of six user‑oriented directives the model must obey throughout its step‑by‑step solution. Following the previous work, IFEval (Zhou et al., 2023), which examined general instruction‑following capabilities of large language models, we employ verifiable instructions that can be automatically evaluated without relying on another LLM. However, unlike IFEval, our focus is on the reasoning trace itself. The six instruction types are crafted to reflect realistic user needs — enabling precise, automatic verification of whether the model adheres to the prescribed reasoning guidance.Multilinguality: Constrains reasoning to a specific language (e.g., Hindi, Arabic).Word limit: Caps verbosity to save cost and improve conciseness.Disclaimer: Enforces a safety reminder appended verbatim at the end.JSON formatting: Ensures structured, machine-readable outputs.Uppercase only: Forces strict formatting and tests fine-grained syntactic control.Remove commas: Similar to "Uppercase only".Here are a few representative examples in our benchmark dataset:

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Other newsrooms on this story

Related reading

How to Prompt Reasoning Models Effectively

Mid-training is essential for LLM reasoning, IBM study shows

The Return of Recursion: How 5M-Parameter Models Are Outperforming Frontier…

MCTS-Reasoning: Tree Search for LLM Reasoning

Why Artificial Analysis uses Ai2's IFBench instruction-following eval | Ai2

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone

Other newsrooms on this story

Related reading

How to Prompt Reasoning Models Effectively

Mid-training is essential for LLM reasoning, IBM study shows

The Return of Recursion: How 5M-Parameter Models Are Outperforming Frontier…

MCTS-Reasoning: Tree Search for LLM Reasoning

Why Artificial Analysis uses Ai2's IFBench instruction-following eval | Ai2

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone