I keep watching teams ship agent systems into production and then discover, on day three, that "the agent needs to wait for a human sometimes" breaks every assumption in their stack. Not because they didn't see it coming, every team plans for HITL. Because every popular agent framework reduces "human in the loop" to "block the Python process on input() and hope for the best."

I spent a day auditing the twelve most popular AI-agent frameworks against a strict production rubric. The results aren't kind. Two frameworks pass. Ten are one production deploy away from breaking.

This post is the receipts.

The rubric (and why it's strict)

A production HITL primitive isn't "the agent can pause for input." That's a 1980s primitive. A production HITL primitive needs six properties: