Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It)

Key Takeaways

You can't unit-test a coach agent the way you test a pure function — the output is non-deterministic and "good" is a judgment call, not an assertion.

An LLM-as-judge harness lets you grade a whole test set automatically against a rubric, which is the only way solo-scale eval stays sustainable.

But the judge is itself a fallible model. If you don't design around its known biases — position, verbosity, self-preference, and quiet drift when the judge model updates — you build a green dashboard that means nothing.

The mitigations that actually work are mechanical, not prompt-magic: shuffle order on every pairwise call, pin the judge version, keep a small human-labelled anchor set, and re-check the judge against it.

Key Takeaways

You can't unit-test a coach agent the way you test a pure function — the output is non-deterministic and "good" is a judgment call, not an assertion.

An LLM-as-judge harness lets you grade a whole test set automatically against a rubric, which is the only way solo-scale eval stays sustainable.

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It)

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It)

Other newsrooms on this story

Related reading

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

A RAG evaluator that admits what it can't judge

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Exploring LLM-as-a-Judge

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Other newsrooms on this story

Related reading

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

A RAG evaluator that admits what it can't judge

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Exploring LLM-as-a-Judge

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk