Your Python rate limiter is lying to you the moment you add a second server

Most rate-limiter tutorials show you a tidy little token bucket that works perfectly — on one machine. Then you deploy to production, where you're running three copies of your app behind a load balancer, and the limiter quietly stops doing its job. Nobody gets an error. Nothing crashes. Your "100 requests per minute" just silently becomes 300, and you don't find out until something downstream falls over.

This post is about why that happens, a small demo you can run to see it, and the one change that fixes it.

The limiter that works on your laptop

Here's a textbook in-memory token bucket. The maths is correct: tokens refill at a fixed rate, a request spends one, and you reject when the bucket is empty.

import time

This post is about why that happens, a small demo you can run to see it, and the one change that fixes it.

The limiter that works on your laptop

Here's a textbook in-memory token bucket. The maths is correct: tokens refill at a fixed rate, a request spends one, and you reject when the bucket is empty.

import time

Your Python rate limiter is lying to you the moment you add a second server

Your Python rate limiter is lying to you the moment you add a second server

Related reading

Building a Rate Limiter That Actually Works

Rate limiting in web apps: what to protect before picking a library

API Throttling: Algorithms, Patterns & Mistakes

Building a Sliding Window Rate Limiter in Redis for a Multi-Region Video API

Rate Limiting in Spring Boot REST APIs: Bucket4j + Redis

Python Tools for Managing API Rate Limits in Data Pipelines

Related reading

Building a Rate Limiter That Actually Works

Rate limiting in web apps: what to protect before picking a library

API Throttling: Algorithms, Patterns & Mistakes

Building a Sliding Window Rate Limiter in Redis for a Multi-Region Video API

Rate Limiting in Spring Boot REST APIs: Bucket4j + Redis

Python Tools for Managing API Rate Limits in Data Pipelines