Special thanks to everyone who contributed to this launch: Stephanie Yuan, Lefan Zhang, Ritvika Pillai, Irene Wang, Carter Williams, T.J Ariyawansa, Gitika Jha, Shoaib Javed and the product leadership from Vivek Singh.

Moving prototype agents to production requires measuring quality across multiple dimensions. Amazon Bedrock AgentCore Evaluations provides large language model (LLM)-as-a-Judge checks and extensible code-based evaluators that capture domain-specific requirements you need for assessing your agentic application.

In financial services and specialized domains, the critical quality dimensions often extend beyond language. A market-intelligence agent must quote stock prices within a configurable live band, follow a mandatory broker-identification workflow before accessing financial profiles, return tool outputs that conform to a strict JSON schema, and withhold personally identifiable information (PII). These checks require deterministic code that produces the same result on identical input. They can also be expensive to run with LLM-as-a-Judge when an objective piece of code is the straightforward choice.

With custom code-based evaluators, you can bring an AWS Lambda function as the evaluation engine. With custom code-based evaluators, you control the scoring logic: regex and structural validation, external data lookups, calls to other services, or business rules. The same evaluator can be used in multiple ways without requiring foundation model (FM) tokens for each request. In on-demand evaluations, it acts as a gate within development workflows and continuous integration and delivery (CI/CD) pipelines. In online evaluation setups, it can score live production traffic. With full control over the evaluation logic through AWS Lambda, you can tailor custom code-based evaluators to your needs. Even if traces come from different agent frameworks, you can use this approach to consistently assess agent quality using your own logic.