Part 1 of 6: Your Pipeline Has a Judge. The Judge Is Cooked.

TL;DR: Researchers tested 20 AI models as judges. 17 out of 20 were statistically biased. True...

giovedì 4 giugno 2026 New tab

827 words~4 min read

TL;DR: Researchers tested 20 AI models as judges. 17 out of 20 were statistically biased. True negative rate: 42.5% — your judge misses bad output more than half the time. If you have an LLM checking another LLM's work, this is your problem.

You probably have this in production right now.

response = await generator.chat(user_query)

review = await evaluator.chat(f"Rate this response 1-10: {response}")

if review.score >= 7:

Part 1 of 6: Your Pipeline Has a Judge. The Judge Is Cooked.

Part 1 of 6: Your Pipeline Has a Judge. The Judge Is Cooked.

Related reading

How to Evaluate AI Agents: LLM-as-Judge Tutorial

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

LLM-as-a-Judge: The Complete Guide to Automated Evaluation at Scale with Azure…

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Reliable, and still wrong

Related reading

How to Evaluate AI Agents: LLM-as-Judge Tutorial

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

LLM-as-a-Judge: The Complete Guide to Automated Evaluation at Scale with Azure…

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Reliable, and still wrong