Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Everybody's eval stack has the same load-bearing assumption nobody audits: that the model-as-judge is telling the truth.

You wrote deterministic checks for the easy stuff â€” schema valid, no PII, latency under budget. Then you hit the subjective stuff â€” "is this answer actually helpful," "did the agent follow the user's intent," "is this summary faithful to the source" â€” and you reached for an LLM judge, because what else are you going to do. Now a model grades your model. And here's the part that should keep you up at night: you never validated the grader. You're shipping or blocking releases based on a 0â€“10 score from a prompt you wrote in twenty minutes, and you have no idea if that score correlates with anything a human would agree with.

I've watched teams trust a green judge dashboard for months, then discover the judge was handing out 8s to answers users hated. The judge wasn't broken in an obvious way. It was just uncalibrated, and uncalibrated graders fail silently â€” which is the worst way to fail.

The judge is a model in production, so treat it like one

Say it plainly: your LLM judge is a non-deterministic model making consequential decisions in your release pipeline. That is the exact thing you spent the last year learning to distrust. Somehow when it's wearing a lab coat and called an "evaluator," people grant it authority they'd never give the agent itself.

Everybody's eval stack has the same load-bearing assumption nobody audits: that the model-as-judge is telling the truth.

The judge is a model in production, so treat it like one

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Related reading

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

A RAG evaluator that admits what it can't judge

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About…

Exploring LLM-as-a-Judge

An open source LLM eval tool with two independent quality signals

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Related reading

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

A RAG evaluator that admits what it can't judge

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About…

Exploring LLM-as-a-Judge

An open source LLM eval tool with two independent quality signals

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce