Storia in 2 fonti

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

MirrorCode benchmark from METR and Epoch AI tests AI agents on reimplementing entire programs. Claude Opus 4.6 rebuilt a 16,000-line toolkit passing 99.95%

Raccontata da

cryptobriefing.com

the-decoder.com

Confronto fonti

2 prospettive sulla stessa storia

AI · summaries

cryptobriefing.comStai leggendo6 g fa

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

MirrorCode benchmark from METR and Epoch AI tests AI agents on reimplementing entire programs. Claude Opus 4.6 rebuilt a 16,000-line toolkit passing 99.95%

originale

the-decoder.com6 g fa

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in…

Leggi questa versione → originale

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Timeline cronologica

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run