Storia in 2 fonti

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.

Raccontata da

cursor.com

marktechpost.com

Confronto fonti

2 prospettive sulla stessa storia

AI · summaries

marktechpost.comStai leggendo6 g fa

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.

originale

cursor.com7 g fa

Reward hacking is swamping model intelligence gains · Cursor

On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding ability with answer retrieval.

Leggi questa versione → originale

Timeline cronologica

giovedì 25 giugno 2026·cursor.com
Reward hacking is swamping model intelligence gains · Cursor
On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding…
sabato 27 giugno 2026·marktechpost.com
Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro
Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.