AI agent tradeoffs: what evals catch and reading traces reveal

Remember being excited (or dreading, depending on the stage of your career and the company you worked at) about writing unit tests? Or sweating all the details in your end-to-end and integration tests you were sure covered all the use cases your users would hit?

These days a lot of UIs are slowly being replaced by a single input field and an agent that promises to deliver the same value a UI would, but with the elegance and pun-ness of a “Jarvis”.

We craft their SOUL.md and their MEMORY.md and the system prompt. We pretend we know what we’re doing setting up evals with prompts we know are not how our users will interact with the agent, but we set the threshold and the confidence score comes back satisfactory and we approve and deploy. Job’s done, right?

Not quite.

Sentry is attending AI Engineer World’s Fair this week and I decided to build a little schedule builder with an agent to help people put together their itineraries. (Shout out to Swyx for providing the data and even the embeddings for all the speakers, talks and tracks.)

These days a lot of UIs are slowly being replaced by a single input field and an agent that promises to deliver the same value a UI would, but with the elegance and pun-ness of a “Jarvis”.

Not quite.

AI agent tradeoffs: what evals catch and reading traces reveal

Other newsrooms on this story

AI agent tradeoffs: what evals catch and reading traces reveal

Other newsrooms on this story

Related reading

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

AI Test Agents Are Useful, but Only If You Keep Them on a Leash

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+…

The Eval Gap: Your Agent Has Observability but No Idea If It's Any Good

AI Agent Evaluation Harness: Test Real Workflows Before Users Do

Tool-Call Accuracy Is Lying to You: A Four-Layer Eval Stack for Agents

Related reading

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

AI Test Agents Are Useful, but Only If You Keep Them on a Leash

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+…

The Eval Gap: Your Agent Has Observability but No Idea If It's Any Good

AI Agent Evaluation Harness: Test Real Workflows Before Users Do

Tool-Call Accuracy Is Lying to You: A Four-Layer Eval Stack for Agents