When I started learning about AI agent evaluation, I thought evals were mostly about checking the final answer.

But agents are not just final-answer machines.

They are systems made of smaller parts:

router

tools