The Problem We Were Actually Solving

At the time, we were facing the classic scaling problem in Treasure Hunt Engine – the popularity of our game had grown exponentially, but our ability to serve that demand had not. Every night at midnight, our batch pipeline would run and populate our warehouse with the previous day's data, but the window for this batch processing was growing longer and longer. Our operators were frantically trying to speed up the pipeline, but it was always a game of catch-up.

What We Tried First (And Why It Failed)

We tried to solve this problem by switching to a streaming architecture, using Apache Kafka to stream events from our application directly into our warehouse. On paper, it seemed like a brilliant solution – we could process data as it happened, rather than trying to play catch-up every night. But what we had overlooked was the sheer volume of data we were generating. Our application was producing tens of millions of events per day, and Kafka was struggling to keep up. The result was a system that was constantly under capacity, and our operators were spending more and more time trying to troubleshoot the issues.

The Architecture Decision