Teams building AI agents typically evaluate them the way they evaluate any other software: by checking whether the output matches expectations. But agents that autonomously choose tools and sequence operations across multiple sources produce behavior that output-level testing cannot fully characterize.

An agent might deliver a well-structured, actionable response while hallucinating, fabricating facts because its tools returned empty results. It might also reach the correct conclusion while skipping the verification steps that a reliable process requires. Because these failures sit below the surface of the final response, catching them requires evaluation that traces the agent’s full execution path: which tools the agent called, what data those tools returned, and whether the response faithfully reflects that data.

Closing this gap requires infrastructure that most agent teams are not staffed to build from scratch. You need test cases with ground truth outcomes, observability instrumentation for capturing tool calls and intermediate state, and metrics that assess faithfulness and tool usage alongside surface accuracy.

Agent-EvalKit is an open-source toolkit (Apache 2.0) that makes this evaluation infrastructure available by integrating with AI coding assistants, including Claude Code, Kiro CLI, and Kilo Code. It brings the entire workflow into your development environment instead of treating evaluation as a separate post-deployment effort. You describe your evaluation goals in natural language, and the toolkit handles each phase, from reading your agent’s source code and generating targeted test cases through running evaluations and producing a report with improvement recommendations that reference specific locations in your code base. The sections that follow walk through how Agent-EvalKit works across its six evaluation phases, using a travel research agent built with the Strands Agents SDK and Amazon Bedrock as a running example.