The premise is simple: a long-running AI agent crashes mid-task. What should happen next?

The common answer is: your orchestrator checkpoints the agent's context, so it can resume from the last step. LangGraph does this with MemorySaver/SQLite. Temporal replays the event log. These work. For the agent's state.

But there's a second thing nobody's checkpointing: the budget envelope around the agent.

What did this run spend before it crashed? How many loops did it complete? How much wall-clock time has elapsed? If those limits live only in memory and the process dies, your safety layer vanishes with it, and a "resumed" agent is really a new agent with a fresh, reset budget.

The failure mode