Leveraging the Wrong Scaling Patterns Will Lose You in Production

The Problem We Were Actually Solving

In hindsight, we were trying to optimize for the wrong problem. We were optimizing for the treasure hunt engine to scale horizontally within a single availability zone in AWS, ignoring the warning signs that we were going to get slammed with requests and lose all that scaling to an eventual single point of failure once we started distributing our data across regions. Our team was convinced that scaling vertically with more instances within the same availability zone would solve all our problems, and that was where we erred.

What We Tried First (And Why It Failed)

Our initial solution was to add more RDS instances behind our NGINX load balancer, hoping that we could scale out our database to meet the increased traffic. But we soon realized that adding more instances didn't solve our disk I/O problem, nor did it alleviate our high CPU usage issue. Our MySQL instances were choking on the increased traffic, and we were seeing long query times that were impacting our ability to serve requests. At that point, we knew we'd have to go back to the drawing board.

The Architecture Decision

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

The Architecture Decision

Leveraging the Wrong Scaling Patterns Will Lose You in Production

Leveraging the Wrong Scaling Patterns Will Lose You in Production

Related reading

Treasure Hunt Engine Was a Disaster Waiting to Happen: A Tale of Unchecked…

When Premature Scaling Leads to Operator Burnout

Performance Lessons from a Real Production Azure Application: Why Scaling…

When I Finally Realized My Runtime Was Holding Me Back

When I Realized We Were Throwing Away Half Our Engine's Potential

Is Your 'Scalable' Backend a Ticking Time Bomb?

Related reading

Treasure Hunt Engine Was a Disaster Waiting to Happen: A Tale of Unchecked…

When Premature Scaling Leads to Operator Burnout

Performance Lessons from a Real Production Azure Application: Why Scaling…

When I Finally Realized My Runtime Was Holding Me Back

When I Realized We Were Throwing Away Half Our Engine's Potential

Is Your 'Scalable' Backend a Ticking Time Bomb?