The first time one of my long agent sessions went sideways, I went looking for the wrong thing. I grepped for context window exceeded. For a stack trace. For a 400 from the API. There was nothing. The run finished. It just finished worse than it started — sloppier tool calls, a step that re-derived something it had already figured out, an answer that ignored a constraint stated twenty steps earlier.
No crash. No error. The agent didn't run out of context. It got dumber on the way there.
Here's the part that took me too long to accept: the damage starts well before the window is full. On a synthetic model of step success vs. how full the window is, the reliability floor gets crossed at 79% occupancy — not at 100% overflow. By the time you see exceeded, the agent has already spent a chunk of the session quietly making weaker steps.
The short version
If you build agents that run for many steps, watch the fraction of the window in use, not just whether it overflows. Step quality holds flat while the window fills, then bends down past a knee around 70–80%. A deterministic handoff (summarize, checkpoint, compact occupancy back down) at a threshold below that knee keeps the whole session above the line. In the model below, it turns 31 good steps out of 50 into 50 out of 50, at the cost of 2 compactions.







