Function-calling eval was a 2024 problem. Tool-using agents are the 2026 one.

Tool-selection accuracy logs 1.0 while the task fails. Here's why one number hides three different failures, and the four-layer eval stack I rebuilt to find the actual root cause.

mercoledì 3 giugno 2026 New tab

TL;DRAI

Tool-selection accuracy (did the agent call the right function?) misses real production failures lurking in argument validation, result usage, and error recovery—each requires separate measurement. End-to-end success is multiplicative: agents with 95% per-step accuracy over 8 steps achieve only 66% overall completion, so teams deploy agents with passing evals that fail catastrophically in production because trajectory-level evaluation is omitted.

1,164 words~5 min read

Here's a trace that reset how I think about evaluating tool-calling agents.

An agent tries to book a flight. It calls search_flights with departure_date="next Friday". The endpoint expected an ISO date, so it returns a 400. The agent retries the same string four times, then apologizes to the user and gives up.

Now the part that actually bothered me. Tool selection was correct. The model picked the right function out of a registry of 28. My tool-selection accuracy logged a clean 1.0. The aggregate task-completion logged a 0. And neither number told me which of three things broke:

the argument was wrong,

the model never read the 400 body, or

Function-calling eval was a 2024 problem. Tool-using agents are the 2026 one.

Function-calling eval was a 2024 problem. Tool-using agents are the 2026 one.

Related reading

Tool-Call Accuracy Is Lying to You: A Four-Layer Eval Stack for Agents

Why your agentic system doesn't survive its own success — and what a…

Your AI agent calls the wrong tool — and your JSON schema is usually why

AI agent tradeoffs: what evals catch and reading traces reveal

Why your AI agent loops forever (and how to break the cycle)

Claude Fable 5 for Agents: Tool-Call Refusals, Cost vs GLM 5.2

Related reading

Tool-Call Accuracy Is Lying to You: A Four-Layer Eval Stack for Agents

Why your agentic system doesn't survive its own success — and what a…

Your AI agent calls the wrong tool — and your JSON schema is usually why

AI agent tradeoffs: what evals catch and reading traces reveal

Why your AI agent loops forever (and how to break the cycle)

Claude Fable 5 for Agents: Tool-Call Refusals, Cost vs GLM 5.2