Stop Asserting Equality: How to Test Agents When Every Run Is Different

Here is the test that quietly destroys most agent codebases: expect(await agent.run("summarize...

venerdì 12 giugno 2026 New tab

TL;DRAI

Agent nondeterminism breaks exact assertions; test invariants (IDs present, no hallucination) and semantic similarity via embedding cosine distance instead. For production, measure pass rate across N runs (target 90%+), not single assertions—quantifying reliability instead of string-equality tests.

1,202 words~5 min read

Here is the test that quietly destroys most agent codebases:

expect(await agent.run("summarize this ticket")).toBe(EXPECTED_SUMMARY);

Enter fullscreen mode

Exit fullscreen mode

It passes on Tuesday. It fails on Wednesday because the model reworded one sentence. So someone adds a .trim(), then a lowercase, then a regex, and three weeks later the assertion is a 40-line normalization function that still flakes twice a week. Eventually the team does the only rational thing left: they delete the test. Now the agent has no tests at all, and everyone agrees "agents are just hard to test."

Stop Asserting Equality: How to Test Agents When Every Run Is Different

Stop Asserting Equality: How to Test Agents When Every Run Is Different

Related reading

5 ways your CLAUDE.md rules quietly fail

Before And After

Why Your AI Agent Works in Dev and Breaks in Prod

Your Agent Failed in Prod. Good Luck Reproducing It.

Why your agentic system doesn't survive its own success — and what a…

Part 3 of 6: Every Agent Passed. The System Failed.

Related reading

5 ways your CLAUDE.md rules quietly fail

Before And After

Why Your AI Agent Works in Dev and Breaks in Prod

Your Agent Failed in Prod. Good Luck Reproducing It.

Why your agentic system doesn't survive its own success — and what a…

Part 3 of 6: Every Agent Passed. The System Failed.