Most agent architectures are secretly stateless. In this guest post, Addy Osmani, Director at Google Cloud, and Shubham Saboo, Senior AI Product Manager at Google Cloud, discuss what it takes to build ones that aren’t.Why most AI agents fail in productionDevelopers spend weeks perfecting prompt engineering, tool calling, and response latency. None of it matters when your agent needs to stay alive for five days.The workflows that actually matter in production — processing thousands of insurance claims, running week-long sales sequences, reconciling financial data across systems — don't fit inside a single conversation turn. They take days, not seconds. And the moment you try to build them, you run into a wall that most tutorials skip over: most agent architectures reconstruct context from scratch on every interaction. They lose the reasoning chain, the soft signals, and the confidence gradients that made the agent's previous decisions make sense.This is the production gap. Demos close it with short, clean tasks. Real systems don't get that luxury.At Google Cloud Next '26, we announced that Agent Runtime now supports long-running agents that maintain state for up to seven days. What follows are five design patterns — drawn from what we've seen actually work in production — for building agents that survive contact with reality.Pattern 1: Checkpoint-and-ResumeThe most common failure mode in multi-day workflows is context loss. An agent processes 200 documents over four hours, then hits an error on document 201. Without checkpointing, you restart from scratch.The fix is conceptually simple but architecturally important: treat your agent like a long-running server process, not a request handler. The same way you'd build a data pipeline that processes millions of records — checkpoint progress, handle partial failures, ensure idempotency.from google.adk import Agent, ToolContext