A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line fix.
TL;DR
Our production ClickHouse cluster started throwing read-only bursts on its replicated tables every few days — each one firing 4–5 Slack alerts, each one self-recovering in seconds. The cause wasn't network, disk, or load. It was a 32-bit transaction-ID counter (the Keeper "xid") overflowing at our request rate of ~6,500 requests/sec/pod. When it wrapped past 2.1 billion, ClickHouse force-killed its Keeper session, every replicated table on that pod went read-only for 2–5 seconds, and it cascaded to the other replicas. The fix is one setting — use_xid_64 — that widens the counter to 64 bits and pushes the overflow horizon from every few days to almost infinite
The symptom: a storm that fixes itself
The first thing we saw was a pattern, not a single failure. Every few days, the production cluster would emit a cluster of alerts:







