This is the third in a series of eight posts on the false assumptions teams make when building with generative AI. Fallacy #1 covered why faster generation doesn't mean faster engineering. Fallacy #2 covered why plausible isn't correct. This post covers why using one AI to check another doesn't solve the problem — it doubles it.
The Fallacy
"If the AI makes mistakes, use another AI to check its work."
Huang et al. (ICLR 2024) showed that LLMs cannot reliably self-correct their reasoning without external feedback, and in some cases self-correction makes the output worse. LLM-as-judge is a special case of this: the same class of system evaluating its own output using the same reasoning that produced the errors. Formal verifiers, schema validators, and dissimilar reasoning engines provide the external feedback the paper says is required.
Why it's tempting









