Here is the test that quietly destroys most agent codebases:

expect(await agent.run("summarize this ticket")).toBe(EXPECTED_SUMMARY);

Enter fullscreen mode

Exit fullscreen mode

It passes on Tuesday. It fails on Wednesday because the model reworded one sentence. So someone adds a .trim(), then a lowercase, then a regex, and three weeks later the assertion is a 40-line normalization function that still flakes twice a week. Eventually the team does the only rational thing left: they delete the test. Now the agent has no tests at all, and everyone agrees "agents are just hard to test."