The math of multi-model consensus: when 3 cheap reviews beat 1 expensive one

there's a reflex in AI tooling that says: when in doubt, reach for the biggest model. bigger model, better review, fewer escaped bugs. it feels obviously true. but if you actually write down the probabilities, the reflex falls apart for a large class of problems. three smaller, cheaper reviews — read together correctly — can beat one expensive one, and not by a little.

this isn't a vibes argument. it's the same math that makes RAID arrays more reliable than a single expensive disk, and ensemble classifiers beat single models in practice. let me show the numbers, then the catch, then how to actually wire it up.

the single-reviewer ceiling

say your best, most expensive model catches a real bug 80% of the time on a given diff. that's genuinely good. it also means it misses one in five. run it again on the same diff and you don't get to 96% — you get back to 80%, because the second pass has the same blind spots as the first. a model's errors aren't random noise you can average away by re-rolling. they're systematic. the bug it can't see, it can't see twice.

so the ceiling on a single reviewer isn't set by how many times you ask. it's set by the model's correlation with itself, which is 1. you are stuck at 80%.

the single-reviewer ceiling

so the ceiling on a single reviewer isn't set by how many times you ask. it's set by the model's correlation with itself, which is 1. you are stuck at 80%.

The math of multi-model consensus: when 3 cheap reviews beat 1 expensive one

The math of multi-model consensus: when 3 cheap reviews beat 1 expensive one

Other newsrooms on this story

Related reading

Mixture of Experts (MoE) Explained Simply: How Modern AI Models Get Bigger…

Smarter Resource Allocation Beats Stronger Models

Can tech companies learn to love cheaper AI models? | TechCrunch

# I stopped trusting a single AI for code review — here's

AI Models Can’t Agree on Basic Facts Most of the Time, Study Shows - Decrypt

I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

Other newsrooms on this story

Related reading

Mixture of Experts (MoE) Explained Simply: How Modern AI Models Get Bigger…

Smarter Resource Allocation Beats Stronger Models

Can tech companies learn to love cheaper AI models? | TechCrunch

# I stopped trusting a single AI for code review — here's

AI Models Can’t Agree on Basic Facts Most of the Time, Study Shows - Decrypt

I A/B tested 4 LLMs on the same 500 queries. The results surprised me.