MirrorCode benchmark from METR and Epoch AI tests AI agents on reimplementing entire programs. Claude Opus 4.6 rebuilt a 16,000-line toolkit passing 99.95%

MirrorCode benchmark from METR and Epoch AI tests AI agents on reimplementing entire programs. Claude Opus 4.6 rebuilt a 16,000-line toolkit passing 99.95%

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate,…