You have a well-designed order service. It writes to the database and publishes an event to Kafka. Clean, decoupled, event-driven. Then Kafka has a brief network hiccup. The database write succeeds. The event publish fails. The order exists. Fulfillment never hears about it. No alert fires. Just a quietly broken order going nowhere.

This is the dual write problem — an architectural correctness problem that exists the moment you write to two separate systems without a coordination mechanism.

The Problem

A dual write occurs when your application writes to two separate systems as part of a single logical operation without atomicity across both. The dangerous failure modes are silent — the HTTP response returns 200, the client gets a success, and nothing downstream happens.

The naive fixes don't work: