there's a reflex in AI tooling that says: when in doubt, reach for the biggest model. bigger model, better review, fewer escaped bugs. it feels obviously true. but if you actually write down the probabilities, the reflex falls apart for a large class of problems. three smaller, cheaper reviews — read together correctly — can beat one expensive one, and not by a little.

this isn't a vibes argument. it's the same math that makes RAID arrays more reliable than a single expensive disk, and ensemble classifiers beat single models in practice. let me show the numbers, then the catch, then how to actually wire it up.

the single-reviewer ceiling

say your best, most expensive model catches a real bug 80% of the time on a given diff. that's genuinely good. it also means it misses one in five. run it again on the same diff and you don't get to 96% — you get back to 80%, because the second pass has the same blind spots as the first. a model's errors aren't random noise you can average away by re-rolling. they're systematic. the bug it can't see, it can't see twice.

so the ceiling on a single reviewer isn't set by how many times you ask. it's set by the model's correlation with itself, which is 1. you are stuck at 80%.