Making AI-Generated Code Fail Gracefully

If your app generates code with an LLM and executes it, you already know the dirty secret: it fails a lot. Not catastrophically — just wrong method names, bad assumptions about state, off-by-one stuff. The kind of errors a human would fix in 10 seconds.

The question is what your user sees when that happens.

The Problem

Version 1 of my app showed users raw Python tracebacks when a generated script failed. Something like: