Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.

On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding…

Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.