Most rate-limiter tutorials show you a tidy little token bucket that works perfectly — on one machine. Then you deploy to production, where you're running three copies of your app behind a load balancer, and the limiter quietly stops doing its job. Nobody gets an error. Nothing crashes. Your "100 requests per minute" just silently becomes 300, and you don't find out until something downstream falls over.

This post is about why that happens, a small demo you can run to see it, and the one change that fixes it.

The limiter that works on your laptop

Here's a textbook in-memory token bucket. The maths is correct: tokens refill at a fixed rate, a request spends one, and you reject when the bucket is empty.

import time