Evaluating Deep Agents using LangSmith on AWS | Amazon Web Services

This post combines learnings from LangChain’s work on evaluating deep agents and Anthropic’s guide to demystifying evals for AI agents into a practical guide. In this post, you will learn how to: 1) apply five evaluation patterns for deep agents, 2) build offline evaluations using pytest and LangSmith, and 3) configure online monitoring for production. The walkthrough uses a text-to-SQL deep agent with Amazon Bedrock for the full development to production lifecycle.

giovedì 28 maggio 2026 New tab

This post was co-authored with Karan Singh, Head of Partnerships at LangChain

Validating AI agent behavior before production is one of the hardest problems in applied AI. Agents are non-deterministic, multi-step where errors in early steps can affect downstream results. A single bad tool call can cascade through an entire workflow. LangSmith on AWS gives you the evaluation framework to catch these issues early, track them in production, and continuously improve your agent’s reliability throughout its lifecycle.

Amazon Nova 2 Lite is a fast, cost-effective reasoning model available in Amazon Bedrock. It supports extended thinking with configurable budget levels (low, medium, high) and accepts text, image, video, and document inputs with a 1 million-token context window. Nova 2 Lite handles instruction following, function calling, and code generation well, which makes it a good fit for agentic workloads like the text-to-SQL agent in this post.

This post was co-authored with Karan Singh, Head of Partnerships at LangChain

Evaluating Deep Agents using LangSmith on AWS | Amazon Web Services

Evaluating Deep Agents using LangSmith on AWS | Amazon Web Services

Other newsrooms on this story

Related reading

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

Build highly scalable serverless LangGraph multi-agent systems in AWS with…

Build custom code-based evaluators in Amazon Bedrock AgentCore | Amazon Web…

The Roadmap to Mastering AI Agent Evaluation

AI Agent Failure Detection and Root Cause Analysis with Strands Evals | Amazon…

Agentic AI Testing: Methods & Best Practices

Other newsrooms on this story

Related reading

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

Build highly scalable serverless LangGraph multi-agent systems in AWS with…

Build custom code-based evaluators in Amazon Bedrock AgentCore | Amazon Web…

The Roadmap to Mastering AI Agent Evaluation

AI Agent Failure Detection and Root Cause Analysis with Strands Evals | Amazon…

Agentic AI Testing: Methods & Best Practices