The Problem We Were Actually Solving
At Veltrix we had a simple monolithic service that handled everything - orders, products, inventory etc which resulted in high failure rates (30-40 % in extreme cases) on certain pages during peak hours. We wanted to break it down and decouple it with the event mesh to solve the high failure rates.
What We Tried First (And Why It Failed)
Our first implementation of an event mesh was built on top of Apache Kafka. We were excited because we had heard of the low latency capabilities and the scalability of the system. However we quickly hit the limitation of Kafka (specifically the max.in.flight.requests.per.connection and replication.factor properties) which resulted in a high number of request retries (40% of all requests would result in at least one retry) on our e-commerce platform during peak hours. We would then end up with hundreds of dead-letter queue messages because of the high failure rates - our system would end up in an incorrect state.
The Architecture Decision








