Here's a trace that reset how I think about evaluating tool-calling agents.
An agent tries to book a flight. It calls search_flights with departure_date="next Friday". The endpoint expected an ISO date, so it returns a 400. The agent retries the same string four times, then apologizes to the user and gives up.
Now the part that actually bothered me. Tool selection was correct. The model picked the right function out of a registry of 28. My tool-selection accuracy logged a clean 1.0. The aggregate task-completion logged a 0. And neither number told me which of three things broke:
the argument was wrong,
the model never read the 400 body, or






