Reward hacking is swamping model intelligence gains · Cursor

Smarter models are becoming more resourceful at hacking coding benchmarks.

Eval suites built from real bugs that were later fixed are especially vulnerable because the problems have already been solved. If the agent has access to repository history or the public web, it can sometimes look up the answer rather than derive it.

To measure how widespread this behavior is, we built an agent to audit eval trajectories. On SWE-bench Pro, we found that 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. When we sealed git history and restricted internet access, scores dropped sharply for Opus as well as for our model, Composer 2.5:

Opus 4.8 Max fell from 87.1% to 73.0%

Composer 2.5 fell from 74.7% to 54.0%

Smarter models are becoming more resourceful at hacking coding benchmarks.

Opus 4.8 Max fell from 87.1% to 73.0%

Composer 2.5 fell from 74.7% to 54.0%

Reward hacking is swamping model intelligence gains · Cursor

Reward hacking is swamping model intelligence gains · Cursor

Other newsrooms on this story

Related reading

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on…

Your AI agent reports 80% task completion. It fabricated it.

The agent that fixes bugs by running the code

Popular AI model performance benchmark may be flawed, Meta researchers warn

How Far Can a Small Coding Model Go With a Better Harness?

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…

Related reading

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on…

Your AI agent reports 80% task completion. It fabricated it.

The agent that fixes bugs by running the code

Popular AI model performance benchmark may be flawed, Meta researchers warn

How Far Can a Small Coding Model Go With a Better Harness?

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…

Other newsrooms on this story