Strands Agents + Langfuse Evaluations

Before diving in, check out AI was supposed to take my job — instead it gave me a new one: Evaluations, a presentation that walks through this PoC.

See source code at Github

In this project we will build a Python banking assistant agent using Strands Agents and make it observable and continuously evaluated using Langfuse — step by step.

Strands Agents is a lightweight Python SDK for building LLM-powered agents with tool use and session memory, open-sourced by AWS in May 2025. It is Python-native — which pairs well with the Langfuse Python SDK — and new enough to be worth exploring. Any other Python agent framework would work just as well for this PoC.

With classic applications, quality is enforced through unit tests, integration tests, and static analysis — every function has a defined contract and a deterministic output you can assert on. In production, metrics (error rates, latency, memory) surface failures reliably.

Before diving in, check out AI was supposed to take my job — instead it gave me a new one: Evaluations, a presentation that walks through this PoC.

See source code at Github

In this project we will build a Python banking assistant agent using Strands Agents and make it observable and continuously evaluated using Langfuse — step by step.

Strands Agents + Langfuse Evaluations

Other newsrooms on this story

Strands Agents + Langfuse Evaluations

Other newsrooms on this story

Related reading

Strands Agents + AgentCore Runtime - a perfect match

Building AI agents with LangChain

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

Building an AI Research Agent with LangGraph, Claude, and AWS

Lessons from LangChain: Designing a Reliable Runtime for Production-Grade Agents

PydanticAI vs LangChain - Choosing an Agent Framework for Production, Not Demos

Related reading

Strands Agents + AgentCore Runtime - a perfect match

Building AI agents with LangChain

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

Building an AI Research Agent with LangGraph, Claude, and AWS

Lessons from LangChain: Designing a Reliable Runtime for Production-Grade Agents

PydanticAI vs LangChain - Choosing an Agent Framework for Production, Not Demos