The Problem We Were Actually Solving
In hindsight, we were trying to optimize for the wrong problem. We were optimizing for the treasure hunt engine to scale horizontally within a single availability zone in AWS, ignoring the warning signs that we were going to get slammed with requests and lose all that scaling to an eventual single point of failure once we started distributing our data across regions. Our team was convinced that scaling vertically with more instances within the same availability zone would solve all our problems, and that was where we erred.
What We Tried First (And Why It Failed)
Our initial solution was to add more RDS instances behind our NGINX load balancer, hoping that we could scale out our database to meet the increased traffic. But we soon realized that adding more instances didn't solve our disk I/O problem, nor did it alleviate our high CPU usage issue. Our MySQL instances were choking on the increased traffic, and we were seeing long query times that were impacting our ability to serve requests. At that point, we knew we'd have to go back to the drawing board.
The Architecture Decision







