MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

Here’s something that should make every software engineer pause mid-coffee-sip: an AI model just reimplemented a 16,000-line bioinformatics toolkit. Not refactored it. Not debugged it. Rebuilt the whole thing from scratch, in a different programming language, passing 99.95% of over 2,000 tests.

MirrorCode, a new benchmark co-developed by AI evaluation organizations METR and Epoch AI, is designed to measure something that most existing coding benchmarks don’t even attempt. Instead of asking AI to solve neat little algorithmic puzzles, it asks a more existential question: can an AI agent autonomously reimplement an entire real-world software program without ever seeing the source code?

How MirrorCode actually works

The benchmark selects real command-line interface programs, gives the AI agent access only to the program’s behavior (inputs and outputs, no source code), and asks it to build a functional replica.

The preliminary results, published on April 10, 2026, cover more than 20 target programs spanning a wide range of domains. Unix utilities, bioinformatics tools, interpreters, static analysis software, cryptography implementations, and compression algorithms all made the cut. Each reimplementation is evaluated through hundreds to thousands of end-to-end tests requiring exact output matching. No partial credit. No “close enough.”

How MirrorCode actually works

The benchmark selects real command-line interface programs, gives the AI agent access only to the program’s behavior (inputs and outputs, no source code), and asks it to build a functional replica.

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source tasks

Other newsrooms on this story

Related reading

An AI model programmed nonstop for 19 days on a single MirrorCode task that…

Cognition introduces FrontierCode benchmark that exposes AI coding agents'…

Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the…

I Let 58 AI Agents Review Each Other's Code 561 Times — Here's What Happened

【Deep Dive】Frontier Code: The Benchmark That Asks "Would a Maintainer Merge…

The Core of a Coding Agent Is 128 Lines of Python. So I Built One From Scratch.

Other newsrooms on this story

Related reading

An AI model programmed nonstop for 19 days on a single MirrorCode task that…

Cognition introduces FrontierCode benchmark that exposes AI coding agents'…

Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the…

I Let 58 AI Agents Review Each Other's Code 561 Times — Here's What Happened

【Deep Dive】Frontier Code: The Benchmark That Asks "Would a Maintainer Merge…

The Core of a Coding Agent Is 128 Lines of Python. So I Built One From Scratch.