The Problem We Were Actually Solving

Behind the scenes, this application relies on a complex interplay of Apache Kafka topics, Redshift views, and Presto queries to fetch user information, generate quest narratives, and push notifications. As users began to interact with the system, performance issues began to surface: query costs were piling up, and we were hitting Redshift's maximum concurrent query limit. We were also unable to meet the 2-minute query latency SLA for our most-used quest types. The root cause of these issues lay in the configuration decisions we made around Apache Kafka partitioning and Presto query optimization.

What We Tried First (And Why It Failed)

Initially, we designed the Treasure Hunt Engine to partition its Kafka topics by user ID. This approach seemed reasonable at first glance, as it would group related events together and enable more efficient aggregation in Presto. However, we soon discovered that this design led to hot partitions – those topics that receive a disproportionate share of writes – which caused uneven Kafka cluster utilization and subsequent latency issues. To make matters worse, our initial Presto query optimization strategy focused on reducing the number of queries executed by caching intermediate results. However, this approach only exacerbated the hot partition problem, as it encouraged the system to produce more queries overall.