What I actually found when I set out to test heterogeneous AI code review.
For the last couple of months, I've been running a two-agent code review workflow in my terminal. Left window: Claude Code doing implementation. Right window: a second Claude Code instance prompted to be adversarial, specifically tasked with finding problems in whatever the left window produced. It worked surprisingly well.
A few weeks ago, someone sent me a Lenny's Podcast episode featuring Dan Shipper. One of the things he talked about was using competing frontier models for coding and reviewing, the idea being that different model lineages have different blind spots, and a reviewer trained differently than the author catches things the author misses. That felt like an important gap in the workflow. I set out to fill it.
I went in expecting to write a post about model diversity as a reliability strategy. That's not what happened.
The plan that didn't survive contact with the code







