Before diving in, check out AI was supposed to take my job — instead it gave me a new one: Evaluations, a presentation that walks through this PoC.

See source code at Github

In this project we will build a Python banking assistant agent using Strands Agents and make it observable and continuously evaluated using Langfuse — step by step.

Strands Agents is a lightweight Python SDK for building LLM-powered agents with tool use and session memory, open-sourced by AWS in May 2025. It is Python-native — which pairs well with the Langfuse Python SDK — and new enough to be worth exploring. Any other Python agent framework would work just as well for this PoC.

With classic applications, quality is enforced through unit tests, integration tests, and static analysis — every function has a defined contract and a deterministic output you can assert on. In production, metrics (error rates, latency, memory) surface failures reliably.