Here's a number worth sitting with. In LangChain's 2026 State of Agent Engineering report, which surveyed more than 1,300 practitioners, 89% of teams running agents in production have implemented observability — but only 52% have implemented evaluations. That 37-point gap is where most agent quality quietly dies.
If you've shipped an LLM agent, you already feel this gap even if you've never named it. You have traces. You have dashboards. You can replay any session and watch the agent reason, call tools, and respond. And yet, when someone asks "is it actually getting better or worse this week?", the honest answer is a shrug. You can see everything that happened and still have no idea whether any of it was good.
That's the difference between observability and evaluation, and conflating the two is the most expensive mistake in agent engineering right now.
Observability tells you what happened. Evals tell you whether it was right.
Observability is a microscope. It shows you the trajectory: the agent received a query, retrieved three documents, called the search_orders tool with these arguments, got this response, and produced this answer. Invaluable for debugging. Completely silent on the question that matters to your users — was the answer correct, helpful, and safe?







