We spent two years teaching everyone that agents are non-deterministic. Same prompt, different output, every run. Fine. We internalized it. We stopped asserting equality, we built model-as-judge evals, we put them in CI.

And then we quietly assumed the evals were deterministic. They are not.

Your eval suite is a non-deterministic system grading another non-deterministic system. If you haven't measured how much your own grader wobbles, you don't have a quality gate. You have a coin flip wearing a lab coat.

The bug that taught me this

We had a model-as-judge check on a support agent: "Does the response correctly resolve the customer's stated issue? Return PASS or FAIL." Green for weeks. Then a release went out, the dashboard stayed green, and complaints spiked anyway.