Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

TL;DR: Smarter models are better judges — unless they're judging their own output. Then they defend...

giovedì 4 giugno 2026 New tab

959 words~4 min read

TL;DR: Smarter models are better judges — unless they're judging their own output. Then they defend wrong answers 86% of the time. Capability makes the bias worse, not better. The only structural fix: generator and judge from different model families.

Part 1: Your judge is biased. 17 out of 20 models. True negative rate: 42.5%. You read that and did the rational thing.

Of course you upgraded.

Old model biased. New model smarter. Smarter means better. Better means fixed.

# The "fix" everyone tries first

Other newsrooms on this story

· 1 sources

Full timeline →

theinformation.com·Jun 1, 2026 · 1 mesi fa
AI Evaluators Struggle with Models That Know When They’re Being Tested

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Other newsrooms on this story

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Other newsrooms on this story

Related reading

Part 6 of 6: How to Build Pipelines That Don't Gaslight Themselves.

Part 1 of 6: Your Pipeline Has a Judge. The Judge Is Cooked.

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

A Better LLM Judge? The Rubric Made My Small Model Worse

Part 3 of 6: Every Agent Passed. The System Failed.

Related reading

Part 6 of 6: How to Build Pipelines That Don't Gaslight Themselves.

Part 1 of 6: Your Pipeline Has a Judge. The Judge Is Cooked.

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

A Better LLM Judge? The Rubric Made My Small Model Worse

Part 3 of 6: Every Agent Passed. The System Failed.