WARPTECHNEWS · LAB
HomeAIBusinessTechArchive
WARPTECH LAB NEWS

Warptech Lab News aggrega le notizie più rilevanti da oltre 700 fonti internazionali, con classificazione AI, TL;DR sintetici e timeline cluster su singole storie.

Navigazione

  • Home
  • Archivio
  • Editor's Brief
  • Cerca
  • Il tuo account
  • Newsletter tech/AI

Informazioni legali

  • Privacy Policy
  • Termini di servizio
  • Cookie Policy

© 2026 Sparktech S.R.L. — Tutti i diritti riservati. Sito gestito e manutenuto da Sparktech S.R.L.

Sede legale: Corso Libertà 55, 13100 Vercelli (VC), Italia · P.IVA / C.F. 02835910023 · Contatti: admin@warptechlab.com

Home
Storia in 2 fonti

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.

Raccontata dacursor.commarktechpost.com

Confronto fonti

2 prospettive sulla stessa storia
AI · summaries
marktechpost.comStai leggendo6 g fa

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.

originale
cursor.com7 g fa

Reward hacking is swamping model intelligence gains · Cursor

On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding ability with answer retrieval.

Leggi questa versione → originale

Timeline cronologica

  1. giovedì 25 giugno 2026·cursor.com

    Reward hacking is swamping model intelligence gains · Cursor

    On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding…

  2. sabato 27 giugno 2026·marktechpost.com

    Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

    Cursor's study finds reward hacking inflates coding-agent benchmark scores, dropping Opus 4.8 Max from 87.1% to 73.0% on SWE-bench Pro.