Evaluate AI agent quality with LLM-as-Judge and trajectory analysis. Catch silent failures, wasted tokens, and hallucinations before production. Python tutorial with code.
Your AI agent just returned "BA117 at 7PM ($450)" - correct answer, 5-star rating. What you didn't see: it made 3 unnecessary API calls and hallucinated a price check. Traditional pass/fail metrics rated this "perfect."
This is the silent failure problem. AI agents return plausible answers while making unnecessary API calls, hallucinating facts, or following unsafe reasoning paths. Binary metrics catch none of this.
This post covers the two foundational evaluation techniques that every agent needs: LLM-as-Judge for output quality and Trajectory Evaluation (the step-by-step path an agent takes) for process quality. These form the base for detecting hallucinations, evaluating tool use, safety alignment, and cost optimization - covered in later posts in this series.
Why Strands Agents? Strands Agents provides automatic trajectory capture via hooks and a dedicated evaluation SDK (strands-agents-evals), making it straightforward to demonstrate these patterns. The evaluation techniques shown here apply to any agent framework, LangGraph, AutoGen, or custom implementations.







