Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in just 14 hours. But every model tested still fails on the most complex tasks.

MirrorCode benchmark from METR and Epoch AI tests AI agents on reimplementing entire programs. Claude Opus 4.6 rebuilt a 16,000-line toolkit passing 99.95%

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate,…