A system loses a replica during a routine maintenance window. Autoscaling compensates. The platform reports healthy. A week later, queue latency begins climbing during peak load — nothing outside thresholds, nothing that pages anyone. Retry traffic rises against a degraded internal API. Circuit breakers begin suppressing low-priority requests. No incident is triggered. Two weeks after the replica was lost, a routine deployment causes widespread scheduling failure because the cluster had already exhausted its resilience margin across three separate dimensions — and the monitoring stack had reported green throughout.
This is the degradation ladder. Not a failure mode. A pre-failure architecture — the accumulated loss of capacity that makes the eventual incident unrecoverable instead of manageable.
What the Degradation Ladder Actually Is
The degradation ladder is a sequence of capability loss events where each rung represents a measurable reduction in a system's ability to absorb the next failure — and where no individual rung is severe enough to trigger an incident response.
The key concept is resilience margin: the operational distance between a system's current state and the point of irreversible instability. Every rung of the degradation ladder erodes that margin.






