Cross-posted from my infrastructure postmortem series at pavanbhatia.hashnode.dev.
At 1:40 AM on Sunday, our 7 TB Oracle-to-Amazon RDS migration was on the verge of collapse.
Database CPU utilization was sitting below 15%, storage I/O looked healthy, and application logs showed zero errors—yet user-facing latency had spiked by nearly 800%. Our final User Acceptance Testing (UAT) validation had ground to a complete halt. We had under four hours before morning business operations resumed and transactional traffic spiked.
As the lead cloud architect steering this cutover, I was looking at a rapidly closing maintenance window. We had to decide whether to continue troubleshooting live under intense time pressure or abort the cutover and execute a disciplined rollback.
What failed that night wasn't Oracle. It was years of hidden assumptions our on-premises architecture had quietly protected us from. Here is the operational breakdown of how we isolated our network constraints, executed a controlled rollback, and ultimately uncovered the real-world cost of cloud network round-trips.






