Client Devices💻📱🖥️📟Rate LimiterToken BucketSliding WindowProtected Server🖥️API ServerRejected (429)

Rate Limiting

Discover powerful algorithms that control request flow and protect your systems from overload

Rate limiting is a fundamental technique in system design to control the rate at which requests are processed. It protects services from being overwhelmed, ensures fair resource allocation, and maintains quality of service.

Why Rate Limiting?

  • Prevent DoS attacks: Stop malicious actors from overwhelming your service
  • Fair usage: Ensure all users get reasonable access to resources
  • Cost control: Limit expensive API calls or resource usage
  • Quality of service: Maintain consistent performance for all users

Implementations

We've implemented four different rate limiting algorithms, each with its own trade-offs:

1. Token Bucket

The token bucket algorithm is one of the most popular rate limiting approaches. Think of it as a bucket that holds tokens:

  • Tokens are added at a constant rate (e.g., 10 tokens per second)
  • Each request consumes one or more tokens
  • Requests are rejected when the bucket is empty
  • Burst capacity allows handling spike traffic up to bucket size
from datetime import datetime
import time

class TokenBucket:
    def __init__(self, rate: float, burst: int):
        self.rate = rate          # tokens per second
        self.burst = burst        # max tokens in bucket
        self.tokens = burst       # start with full bucket
        self.last_time = time.time()  # last refill time

Visual Flow:

Token Bucket

tokens available

Calculate ElapsedRefill Tokenselapsed × ratetokens≥ 1?✓ Allowtokens✗ RejectRate LimitedRequestYesNo⏱️Time-basedRefillToken Bucket Algorithm Flow

Core Algorithm Logic:

def allow(self) -> bool:
    # 1. Calculate elapsed time and refill tokens
    now = time.time()
    elapsed = now - self.last_time
    self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
    self.last_time = now
    
    # 2. Check if tokens available
    if self.tokens >= 1:
        self.tokens -= 1  # Consume token
        return True       # Allow request
    return False          # Reject request

Pros:

  • Smooth rate limiting
  • Allows burst traffic
  • Memory efficient

Cons:

  • Not strictly accurate over short intervals
  • Requires token calculation on each request

2. Fixed Window Counter

Divides time into fixed windows and counts requests per window:

Window 1 (12:00:00-12:00:59)8/10 requestsWindow 2 (12:01:00-12:01:59)2/10 requestsUsed requestsAvailable requests
import time
from datetime import datetime, timedelta

class FixedWindow:
    def __init__(self, window_seconds: int, limit: int):
        self.window_seconds = window_seconds  # window size in seconds
        self.limit = limit                    # max requests per window
        self.counter = 0                      # current window request count
        self.window_start = None              # current window start time

Core Algorithm Logic:

def allow(self) -> bool:
    # 1. Calculate current window
    now = time.time()
    current_window = int(now // self.window_seconds) * self.window_seconds
    
    # 2. Reset counter if new window
    if self.window_start != current_window:
        self.counter = 0
        self.window_start = current_window
    
    # 3. Check if under limit
    if self.counter < self.limit:
        self.counter += 1  # Increment counter
        return True        # Allow request
    return False           # Reject request

The Boundary Problem:

11:59:0011:59:3012:00:0012:00:3012:01:00Fixed Window 1Fixed Window 210 requests11:59:50-5910 requests12:00:00-1020 requests in 20s!Fixed Window Boundary Problem

3. Sliding Window Log

Stores timestamps of all requests within the current time window:

import time
from typing import List

class SlidingWindowLog:
    def __init__(self, window_seconds: int, limit: int):
        self.window_seconds = window_seconds
        self.limit = limit
        self.timestamps: List[float] = []  # all request timestamps in window

Core Algorithm Logic:

def allow(self) -> bool:
    now = time.time()
    cutoff = now - self.window_seconds  # Window boundary
    
    # 1. Remove expired timestamps
    while self.timestamps and self.timestamps[0] < cutoff:
        self.timestamps.pop(0)  # Remove first element
    
    # 2. Check if under limit
    if len(self.timestamps) < self.limit:
        self.timestamps.append(now)  # Add current request
        return True   # Allow request
    return False      # Reject request

How Sliding Window Log Fixes the Boundary Problem:

11:59:0011:59:3012:00:0012:00:3012:01:00Sliding Window (60s)11:59:05 → 12:00:055 requests11:59:10-405 allowed11:59:50-595 rejected12:00:00-10✓ Only 10 requests in any 60s windowAllowed (within window)Rejected (exceeds limit)Sliding Window Log: Accurate Rate Limiting

Most accurate but memory intensive (O(n) where n = requests in window)

4. Sliding Window Counter

Hybrid approach combining fixed windows with weighted calculation:

Previous window: 80 requests
Current window: 40 requests
Position: 25% into current window

Weighted count = 40 + (80 × 0.75) = 100
import time

class SlidingWindowCounter:
    def __init__(self, window_seconds: int, limit: int):
        self.window_seconds = window_seconds  # window size in seconds
        self.limit = limit                    # max requests per window
        self.current_count = 0                # requests in current window
        self.previous_count = 0               # requests in previous window
        self.current_window = None            # current window start time

Core Algorithm Logic:

def allow(self) -> bool:
    now = time.time()
    current_window = int(now // self.window_seconds) * self.window_seconds
    
    # 1. Update counters if new window
    if self.current_window != current_window:
        self.previous_count = self.current_count
        self.current_count = 0
        self.current_window = current_window
    
    # 2. Calculate weighted count
    elapsed = now - current_window
    weight = 1.0 - (elapsed / self.window_seconds)
    estimated_count = self.current_count + int(self.previous_count * weight)
    
    # 3. Check if under limit
    if estimated_count < self.limit:
        self.current_count += 1  # Increment current window
        return True              # Allow request
    return False                 # Reject request

How Sliding Window Counter Approximates the Solution:

11:59:0011:59:3012:00:0012:00:3012:01:00Previous Window: 8 requestsCurrent Window: 2 requestsNow (12:00:15)25% into window8 × 0.75 = 6Previous window weight2Current countEstimated Count = 2 + 6 = 8 requests (< 10 limit ✓)Weight = 1 - (elapsed / window_size) = 1 - (15s / 60s) = 0.75Previous window (weighted)Current windowCurrent timeSliding Window Counter: Memory Efficient Approximation

Performance Comparison

Performance Comparison

Token Bucket43ns/op
Memory: 0 B/op
Fixed Window43ns/op
Memory: 0 B/op
Sliding Log45ns/op
Memory: 100 B/op
Sliding Counter51ns/op
Memory: 0 B/op

Note: Benchmarks run on Apple M4 Pro. Lower latency is better. All algorithms except Sliding Log have zero memory allocations per operation.

Algorithm Comparison

🪣Token Bucket

Memory:O(1)
Accuracy:Good
Complexity:Low
Best for: General purpose, APIs

KEY FEATURES

  • Burst handling
  • Simple implementation
  • Memory efficient

🪟Fixed Window

Memory:O(1)
Accuracy:Fair
Complexity:Very Low
Best for: Simple quotas

KEY FEATURES

  • Simplest algorithm
  • Reset at intervals
  • Minimal overhead

📊Sliding Log

Memory:O(n)
Accuracy:Excellent
Complexity:Medium
Best for: Critical accuracy

KEY FEATURES

  • Most accurate
  • Request-level tracking
  • Higher memory usage

🔢Sliding Counter

Memory:O(1)
Accuracy:Very Good
Complexity:Medium
Best for: High-traffic APIs

KEY FEATURES

  • Good balance
  • Approximates sliding log
  • Memory efficient

Interactive Demo

Try different rate limiting algorithms and see how they behave:

Rate Limiter Simulator

Rate:10 req/s
Burst:20 requests

Request Timeline

Click "Run Simulation" to see requests

Production Considerations

1. Distributed Rate Limiting

In a distributed system, you need to coordinate rate limits across multiple servers:

# Using Redis for distributed rate limiting
import redis
from typing import Optional

class RedisRateLimiter:
    def __init__(self, redis_client: redis.Redis, local_limiter: Optional[object] = None):
        self.client = redis_client
        self.local = local_limiter  # fallback

2. Rate Limit Headers

Always return rate limit information to clients:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1640995200

3. Graceful Degradation

def allow_with_fallback(self, key: str) -> bool:
    try:
        # Try distributed rate limiter
        return self.distributed_allow(key)
    except Exception:
        # Fall back to local rate limiter
        return self.local.allow() if self.local else False

Code Examples

Basic Usage

# Create a rate limiter
rl = TokenBucket(
    rate=100.0,   # 100 requests per second
    burst=200     # burst up to 200
)

# Check if request is allowed
if rl.allow():
    # Process request
    print("Request allowed")
else:
    # Return 429 Too Many Requests
    print("Rate limit exceeded")

API Gateway Example

from flask import Flask, request, jsonify
from functools import wraps

def rate_limit_middleware(rate_limiter):
    def decorator(f):
        @wraps(f)
        def decorated_function(*args, **kwargs):
            client_ip = request.remote_addr
            
            if not rate_limiter.allow():
                return jsonify({
                    "error": "Rate limit exceeded",
                    "retry_after": 60
                }), 429
            
            return f(*args, **kwargs)
        return decorated_function
    return decorator

# Usage example
app = Flask(__name__)
rl = TokenBucket(rate=100.0, burst=200)

@app.route('/api/data')
@rate_limit_middleware(rl)
def get_data():
    return jsonify({"data": "Hello World"})

Implementation

The Python implementations above provide clear, readable code for all four rate limiting algorithms. Key benefits of the Python version:

  • Cleaner syntax - easier to understand the core algorithm logic
  • Better readability - intuitive time handling and data structures
  • Type hints - improved code documentation and IDE support
  • Pythonic patterns - follows Python best practices

For production use, consider:

  • Using collections.deque instead of list for better performance in Sliding Window Log
  • Adding proper error handling and logging
  • Implementing distributed coordination with Redis or similar
  • Adding metrics and monitoring

Next Steps

  • Learn about Consistent Hashing for distributed systems
  • Implement rate limiting in your own projects

References