The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line fix.

TL;DR

Our production ClickHouse cluster started throwing read-only bursts on its replicated tables every few days — each one firing 4–5 Slack alerts, each one self-recovering in seconds. The cause wasn't network, disk, or load. It was a 32-bit transaction-ID counter (the Keeper "xid") overflowing at our request rate of ~6,500 requests/sec/pod. When it wrapped past 2.1 billion, ClickHouse force-killed its Keeper session, every replicated table on that pod went read-only for 2–5 seconds, and it cascaded to the other replicas. The fix is one setting — use_xid_64 — that widens the counter to 64 bits and pushes the overflow horizon from every few days to almost infinite

The symptom: a storm that fixes itself

The first thing we saw was a pattern, not a single failure. Every few days, the production cluster would emit a cluster of alerts:

A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line fix.

TL;DR

The symptom: a storm that fixes itself

The first thing we saw was a pattern, not a single failure. Every few days, the production cluster would emit a cluster of alerts:

The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

Other newsrooms on this story

The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

Other newsrooms on this story

Related reading

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in…

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

The Day the GC Tuning Patch Broke the Leaderboard

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

We Replaced Redis with MySQL SKIP LOCKED for Inventory Reservation — Oversells…

Related reading

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in…

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

The Day the GC Tuning Patch Broke the Leaderboard

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

We Replaced Redis with MySQL SKIP LOCKED for Inventory Reservation — Oversells…