I've been running Strimzi Kafka in production at scale for the past few years - multi-cloud, multi-zone, mixed broker sizes, the usual. The first year I spent more time firefighting Kafka than building anything on top of it. The next two I spent slowly stripping operational pain out of the setup until it stopped paging me.

This is the configuration I landed on, why each part is shaped the way it is, and the production failure modes that drove each decision. No theory, no marketing, no Hello-World defaults - only the cluster I actually run.

The problem

A production Kafka cluster on Kubernetes has to survive at least four things at once:

A single broker dying mid-write.