TL;DR: Smarter models are better judges — unless they're judging their own output. Then they defend wrong answers 86% of the time. Capability makes the bias worse, not better. The only structural fix: generator and judge from different model families.
Part 1: Your judge is biased. 17 out of 20 models. True negative rate: 42.5%. You read that and did the rational thing.
Of course you upgraded.
Old model biased. New model smarter. Smarter means better. Better means fixed.
# The "fix" everyone tries first








