I Built a 131-Test Eval Harness Before Writing New Features. Here's the Silent Failure It Caught.

Originally published on AIdeazz — cross-posted here with canonical link.

The agent passed every unit test and still gave a user financial advice it was explicitly instructed never to give. No exception thrown, no log line in red, no failed assertion. The function returned a clean 200 and a well-formed string. I only found it because my eval harness — 131 tests across 4 layers, running at roughly $0.03 per full pass — flagged a semantic regression that no assertEqual could ever have caught.

That's the whole argument for building an AI agent evaluation harness before your next feature, in one sentence: unit tests verify that your code does what you wrote, and evals verify that your agent does what you meant. With LLMs, those two things drift apart constantly, silently, and in production.

Why unit tests structurally can't catch this

A unit test checks a deterministic contract. Input X produces output Y. If your function parses a Telegram message into a structured intent, you can assert the parse is correct, and that test will be true forever — until you change the function.

Originally published on AIdeazz — cross-posted here with canonical link.

Why unit tests structurally can't catch this

I Built a 131-Test Eval Harness Before Writing New Features. Here's the Silent Failure It Caught.

I Built a 131-Test Eval Harness Before Writing New Features. Here's the Silent Failure It Caught.

Related reading

My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit…

Token-level eval harness for tool-calling agents: what we wired up

AI Agent Evaluation Harness: Test Real Workflows Before Users Do

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm…

The most dangerous line of code your AI agent writes is the test that passes

Agent Series (21): Harness Testing — 45 Tests, How They're Designed, and What…

Related reading

My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit…

Token-level eval harness for tool-calling agents: what we wired up

AI Agent Evaluation Harness: Test Real Workflows Before Users Do

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm…

The most dangerous line of code your AI agent writes is the test that passes

Agent Series (21): Harness Testing — 45 Tests, How They're Designed, and What…