TL;DR

I built agent-eval, a framework that runs real agentic loops with tool calls against live LLM backends, then evaluates outputs through a three-tier assertion pyramid. I threw 10 adversarial scenarios at 5 models. The best scored 62.5%. The worst scored 34%.

Every model failed the same three tests. That's the interesting part.

The Problem With LLM Evals

Most LLM evaluations test the wrong thing. They check if the model can answer trivia, write code snippets, or follow formatting instructions. That's like testing a car's paint job instead of its brakes.