Smarter models are becoming more resourceful at hacking coding benchmarks.

Eval suites built from real bugs that were later fixed are especially vulnerable because the problems have already been solved. If the agent has access to repository history or the public web, it can sometimes look up the answer rather than derive it.

To measure how widespread this behavior is, we built an agent to audit eval trajectories. On SWE-bench Pro, we found that 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. When we sealed git history and restricted internet access, scores dropped sharply for Opus as well as for our model, Composer 2.5:

Opus 4.8 Max fell from 87.1% to 73.0%

Composer 2.5 fell from 74.7% to 54.0%