You shipped your agent. Evals were green. A week later you tweak the system prompt to fix one annoying edge case, the CI eval suite passes, you merge, and the next morning your support queue is on fire because the agent now refuses half the legitimate requests it used to handle.
This is the part nobody talks about: passing a pre-merge eval is not the same as knowing a change is safe in production. Your eval suite grades the cases you thought to write down. Production has cases you didn't. The gap between those two sets is exactly where agent changes go to die.
The fix is not "write more tests." It's borrowing something web infra has had for fifteen years and almost no agent team uses: shadow deployments and canary evals.
The deploy model agent teams skipped
When you deploy a normal service, you don't flip 100% of traffic to the new version and pray. You run a canary — 1%, then 5%, then 25% — and you watch error rates, latency, and saturation at each step. If the new version regresses, you halt and roll back before most users ever touch it.






