This is reliability engineering from the operator side of a high-volume digital payments platform, where the error budget isn't an abstraction — it's measured in failed transactions, eroded trust, and regulatory scrutiny. The standard SRE playbook still applies, but several of its comfortable assumptions break. This is where, and why.

Quick definitions. SLA is the contractual promise to customers (often with penalties). SLO is the internal target you actually engineer toward (usually stricter than the SLA). Error budget is the inverse of your SLO — if your availability SLO is 99.95%, your error budget is the 0.05% of time you're allowed to be down before you've broken your own target. The budget is a quantity you spend: on risk, on deploys, on the occasional bad day.

The decision in one table

What changes when downtime equals lost money:

Standard SRE assumption