Epoch AI's new MirrorCode benchmark tests whether AI models can recreate entire programs on their own. Claude Opus 4.7 leads with 56 percent, but every model still fails on the most complex tasks.
In the new MirrorCode coding benchmark from Epoch AI and METR, AI models have to reimplement complete programs from scratch without access to the original source code.
The 25 target programs cover Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each AI-generated solution must exactly reproduce the output of the original program, including hidden end-to-end tests the model never sees during development.
Another difference from many other benchmarks is the inference budget. Existing software engineering benchmarks often cap costs at $1 to $10 per task, even when a human would need weeks to finish the same work, the developers write.
According to Epoch AI, one of the largest tasks in MirrorCode cost $2,600 for a single run. The AI worked continuously for 19 days with no human involvement at all.









