Not every failure is permanent.

This is something I didn't think about before. When something fails in my app, my first thought was something broke, fix it. But when I started learning how distributed systems actually work, I realized that some failures are not really failures. They're just temporary.

Network glitch. API timeout. A service that just restarted. Rate limiting kicking in. These are all failures but they last for a very short time window. If your system tries the same operation again after a few seconds, it will probably succeed.

So the question is does your system know how to try again? Or does it just give up the first time something goes wrong?

That's what retry is.