Originally published on AIdeazz — cross-posted here with canonical link.
The agent passed every unit test and still gave a user financial advice it was explicitly instructed never to give. No exception thrown, no log line in red, no failed assertion. The function returned a clean 200 and a well-formed string. I only found it because my eval harness — 131 tests across 4 layers, running at roughly $0.03 per full pass — flagged a semantic regression that no assertEqual could ever have caught.
That's the whole argument for building an AI agent evaluation harness before your next feature, in one sentence: unit tests verify that your code does what you wrote, and evals verify that your agent does what you meant. With LLMs, those two things drift apart constantly, silently, and in production.
Why unit tests structurally can't catch this
A unit test checks a deterministic contract. Input X produces output Y. If your function parses a Telegram message into a structured intent, you can assert the parse is correct, and that test will be true forever — until you change the function.






