Here’s something that should make every software engineer pause mid-coffee-sip: an AI model just reimplemented a 16,000-line bioinformatics toolkit. Not refactored it. Not debugged it. Rebuilt the whole thing from scratch, in a different programming language, passing 99.95% of over 2,000 tests.

MirrorCode, a new benchmark co-developed by AI evaluation organizations METR and Epoch AI, is designed to measure something that most existing coding benchmarks don’t even attempt. Instead of asking AI to solve neat little algorithmic puzzles, it asks a more existential question: can an AI agent autonomously reimplement an entire real-world software program without ever seeing the source code?

How MirrorCode actually works

The benchmark selects real command-line interface programs, gives the AI agent access only to the program’s behavior (inputs and outputs, no source code), and asks it to build a functional replica.

The preliminary results, published on April 10, 2026, cover more than 20 target programs spanning a wide range of domains. Unix utilities, bioinformatics tools, interpreters, static analysis software, cryptography implementations, and compression algorithms all made the cut. Each reimplementation is evaluated through hundreds to thousands of end-to-end tests requiring exact output matching. No partial credit. No “close enough.”