Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

We spent two years teaching everyone that agents are non-deterministic. Same prompt, different output, every run. Fine. We internalized it. We stopped asserting equality, we built model-as-judge evals, we put them in CI.

And then we quietly assumed the evals were deterministic. They are not.

Your eval suite is a non-deterministic system grading another non-deterministic system. If you haven't measured how much your own grader wobbles, you don't have a quality gate. You have a coin flip wearing a lab coat.

The bug that taught me this

We had a model-as-judge check on a support agent: "Does the response correctly resolve the customer's stated issue? Return PASS or FAIL." Green for weeks. Then a release went out, the dashboard stayed green, and complaints spiked anyway.

And then we quietly assumed the evals were deterministic. They are not.

The bug that taught me this

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Related reading

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Why your agentic system doesn't survive its own success — and what a…

Part 3 of 6: Every Agent Passed. The System Failed.

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Put Your Agent Evals in CI or Stop Calling Them Evals

Related reading

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Why your agentic system doesn't survive its own success — and what a…

Part 3 of 6: Every Agent Passed. The System Failed.

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Put Your Agent Evals in CI or Stop Calling Them Evals