Every agent failure follows a pattern. Once you know the patterns, you can catch them before they do damage.

I introduced harness engineering yesterday — the discipline of building a safety and reliability layer around AI agents. Today I want to get concrete. These are the seven failure modes every team hits when they run agents in production, how to detect each one, and what to do when you catch it.

1. The Infinite Loop

What it looks like: The agent calls grep with the same pattern six times, gets identical results each time, and never acts on any of them. It's "gathering context." Each call burns tokens. The context window fills. Eventually the session times out or produces garbage.

Why tools miss it: Observability dashboards show six grep calls. They look productive. Orchestration frameworks execute each call faithfully. No error fires. The session is "still running" until it isn't.