Category: Events

The Problem We Were Actually Solving

Behind the scenes, this application relies on a complex interplay of Apache Kafka topics, Redshift views, and Presto queries to fetch user information, generate quest narratives, and push notifications. As users began to interact with the system, performance issues began to surface: query costs were piling up, and we were hitting Redshift's maximum concurrent query limit. We were also unable to meet the 2-minute query latency SLA for our most-used quest types. The root cause of these issues lay in the configuration decisions we made around Apache Kafka partitioning and Presto query optimization.

What We Tried First (And Why It Failed)

Initially, we designed the Treasure Hunt Engine to partition its Kafka topics by user ID. This approach seemed reasonable at first glance, as it would group related events together and enable more efficient aggregation in Presto. However, we soon discovered that this design led to hot partitions – those topics that receive a disproportionate share of writes – which caused uneven Kafka cluster utilization and subsequent latency issues. To make matters worse, our initial Presto query optimization strategy focused on reducing the number of queries executed by caching intermediate results. However, this approach only exacerbated the hot partition problem, as it encouraged the system to produce more queries overall.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

Category: Events

Category: Events

Other newsrooms on this story

Related reading

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

When I Finally Realized My Runtime Was Holding Me Back

How the Events Table That Looked Right Killed Our Queue

The Flawed Promise of Real-Time Event Handling

The Moment the Runtime Became Your Enemy

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

Other newsrooms on this story

Related reading

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

When I Finally Realized My Runtime Was Holding Me Back

How the Events Table That Looked Right Killed Our Queue

The Flawed Promise of Real-Time Event Handling

The Moment the Runtime Became Your Enemy

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…