Rate limiting is everywhere. Every time you hit βSendβ on Twitter, make an API call, or refresh a page, thereβs a rate limiter protecting the system from overload.
βDesign a rate limiterβ appears at Amazon, Google, Stripe, Meta, and virtually every tech company. Why?
- Critical for production systems - prevents abuse and ensures stability
- Tests distributed systems knowledge - involves coordination across servers
- Has multiple valid solutions - lets you showcase trade-off analysis
- Appears in real work - youβll actually implement this on the job
This guide walks you through designing a production-grade rate limiter, using the exact framework that gets you hired.
Phase 1: Clarify Requirements (5 minutes)
Start by understanding what youβre building. Never jump straight to the solution.
Functional Requirements
You: βLet me clarify the functional requirements:
- Are we building a rate limiter for an API, or a generic rate limiter service?
- What should happen when the limit is exceeded? Return an error, or queue requests?
- Do we need different rate limits for different endpoints or users?
- Should rate limits be per user, per IP, per API key, or configurable?
- Do we need to notify users when theyβre approaching their limit?β
Interviewer: βBuild a generic rate limiter that can be used across our microservices. When limits are exceeded, reject requests with a 429 status. Support different limits per user ID. We donβt need notifications.β
You: βGot it. So the MVP is:
- Rate limit HTTP requests
- Configurable limits per user ID
- Reject excess requests with 429 Too Many Requests
- Should work across multiple services
Let me clarify what rules weβre enforcing.β
Rate Limiting Rules
You: βWhat type of rate limiting rules do we need?β
Interviewer: βLetβs support:
- X requests per second
- X requests per minute
- X requests per hour
For example: 10 requests/second, 100 requests/minute, 1000 requests/hour.β
You: βPerfect. Multiple time windows, got it.β
Non-Functional Requirements
You: βNow for scale and performance:
- How many requests per second will the rate limiter process?
- Whatβs the latency requirement? How fast must the check be?
- Are we deploying across multiple data centers?
- Whatβs acceptable accuracy? Is 99% accuracy okay, or do we need 100%?
- Should the rate limiter be highly available?β
Interviewer: βThe rate limiter will protect services handling 1 million requests/second. Latency must be under 10ms - this is in the hot path. Multi-region deployment. 99.99% accuracy is fine - we can tolerate occasional false negatives. High availability is critical - if it goes down, all services go down.β
You: βExcellent, let me summarize:β
Functional Requirements:
β
Rate limit by user ID
β
Support multiple time windows (second, minute, hour)
β
Return 429 when limit exceeded
β
Work across microservices
Non-Functional Requirements:
- 1M requests/second throughput
- Under 10ms latency (low latency is critical)
- Multi-region deployment
- 99.99% accuracy acceptable
- High availability required
Phase 2: Rate Limiting Algorithms (10 minutes)
You: βBefore designing the system, let me discuss different rate limiting algorithms and their trade-offs.β
Algorithm 1: Fixed Window Counter
How it works:
Time windows: 00:00-00:59, 01:00-01:59, etc.
Limit: 100 requests/minute
00:00 - Request 1 β Counter = 1 β
00:30 - Request 100 β Counter = 100 β
00:45 - Request 101 β Counter = 101 β Rejected
01:00 - Counter resets to 0
Pros:
- Simple to implement
- Memory efficient (one counter per window)
- Easy to understand
Cons:
- Boundary problem: User can make 100 requests at 00:59 and 100 at 01:00 (200 in 2 seconds!)
- Allows burst traffic at window edges
Algorithm 2: Sliding Window Log
How it works:
Store timestamp of each request in a list
When new request arrives:
1. Remove timestamps older than 60 seconds
2. Count remaining timestamps
3. If count < limit, allow request
Example:
Limit: 3 requests/minute
Current time: 12:01:30
Request log:
- 12:00:35 (removed, older than 60 sec)
- 12:00:45 (removed)
- 12:01:00 β
(within 60 sec)
- 12:01:20 β
- 12:01:25 β
Count = 3, limit reached
New request at 12:01:30 β Rejected β
Pros:
- Very accurate - no boundary issues
- True sliding window
Cons:
- Memory intensive - store every request timestamp
- Expensive to clean up old timestamps
Algorithm 3: Sliding Window Counter
How it works:
Hybrid approach - combines fixed window with sliding window
Rate = (Requests in previous window Γ overlap percentage)
+ (Requests in current window)
Example:
- Previous window (00:00-00:59): 80 requests
- Current window (01:00-01:59): 30 requests
- Current time: 01:15 (25% into current window)
Estimated requests in last 60 seconds:
= 80 Γ (1 - 0.25) + 30
= 60 + 30 = 90 requests
Pros:
- More accurate than fixed window
- Less memory than sliding window log
- Good approximation
Cons:
- Slightly less accurate than true sliding window
- More complex calculation
Algorithm 4: Token Bucket
How it works:
Bucket holds tokens (capacity = rate limit)
Tokens added at constant rate (refill rate)
Each request consumes one token
Example:
Capacity: 10 tokens
Refill: 1 token/second
00:00 - Bucket has 10 tokens
00:00 - Request 1 β Consume 1 token β 9 left β
00:05 - 5 tokens refilled β 14 tokens (capped at 10)
00:05 - 10 rapid requests β 0 tokens left
00:06 - Request 11 β 1 token refilled β Allow β
Pros:
- Allows short bursts (uses bucket capacity)
- Smooth traffic over time
- Memory efficient (store 2 values: tokens, last refill time)
Cons:
- Tuning refill rate can be tricky
- May allow bursts that overload system
Algorithm 5: Leaky Bucket
How it works:
Requests go into a queue (bucket)
Queue processes requests at constant rate
If queue is full, reject request
Like a bucket with a hole at the bottom -
water leaks out at constant rate
Pros:
- Processes requests at stable rate
- Good for smoothing bursty traffic
Cons:
- Recent requests may be rejected while old requests process
- Queue management overhead
Recommended Algorithm: Token Bucket
You: βFor this system, I recommend Token Bucket because:
- Memory efficient - only store tokens count and timestamp
- Allows reasonable bursts - matches real usage patterns
- Simple to implement - straightforward logic
- Good for distributed systems - easy to coordinate
The burst allowance is actually beneficial - legitimate users often have bursty patterns.β
Phase 3: High-Level Design (10 minutes)
You: βLet me design the architecture.β
Version 1: Single Server
[Client] β [Application Server] β [Rate Limiter (in-memory)]
β
[Redis Cache]
β
[Database]
You: βBasic flow:
- Request arrives with user ID
- Rate limiter checks if user has tokens
- If yes β allow request, decrement tokens
- If no β reject with 429β
Where to Place the Rate Limiter?
You: βWe have three options for placement:β
Option 1: Client-side rate limiting
[Client] β Rate limiting logic β [Server]
- β Not secure - users can bypass
- Only use as optimization, not enforcement
Option 2: Server-side rate limiting
[Client] β [API Server + Rate Limiter] β [Service]
- β Full control
- β Adds latency to service code
Option 3: Middleware/API Gateway
[Client] β [API Gateway + Rate Limiter] β [Services]
- β Centralized - one place for all services
- β Services stay clean
- β Easy to update rules
Iβd choose Option 3: API Gateway because:
- Separation of concerns
- Reusable across services
- Easier to manage and update
API Design
You: βThe rate limiter needs a simple API:β
CheckRateLimit(userID, limit, window)
Request:
{
"user_id": "user123",
"rate_limit": {
"requests": 100,
"window": "minute"
}
}
Response:
{
"allowed": true,
"remaining": 45,
"retry_after": 0
}
Or if rate limited:
{
"allowed": false,
"remaining": 0,
"retry_after": 42 // seconds until reset
}
Response Headers
You: βIndustry standard is to include rate limit info in response headers:β
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 // Total limit
X-RateLimit-Remaining: 45 // Requests left
X-RateLimit-Reset: 1641234600 // Unix timestamp of reset
// If rate limited:
HTTP/1.1 429 Too Many Requests
Retry-After: 42 // Seconds to wait
Phase 4: Distributed Design (15 minutes)
Interviewer: βYour single-server design wonβt handle 1 million requests/second. How do you scale this?β
The Distributed Problem
You: βWith multiple API servers, we face a challenge:β
Scenario: User has 10 requests/second limit
[User] β [Server 1] β Redis (counter: 3)
β [Server 2] β Redis (counter: 4)
β [Server 3] β Redis (counter: 5)
Total: 12 requests/second β Exceeds limit!
Problem: Race conditions in distributed environment
Solution 1: Centralized Redis
Architecture:
[API Server 1] β
[API Server 2] β [Redis Cluster] β Check token bucket
[API Server 3] β
Token Bucket Implementation in Redis:
-- Lua script runs atomically in Redis
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local requested = tonumber(ARGV[3])
-- Get current tokens and last refill time
local tokens = redis.call('HGET', key, 'tokens')
local last_refill = redis.call('HGET', key, 'last_refill')
local now = redis.call('TIME')
if tokens == false then
tokens = capacity
last_refill = now
else
-- Calculate tokens to add based on time passed
local time_passed = now - last_refill
local tokens_to_add = time_passed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
last_refill = now
end
-- Check if we have enough tokens
if tokens >= requested then
tokens = tokens - requested
redis.call('HSET', key, 'tokens', tokens)
redis.call('HSET', key, 'last_refill', last_refill)
return {1, tokens} -- Allowed, remaining tokens
else
return {0, tokens} -- Rejected, current tokens
end
You: βLua scripts in Redis execute atomically, solving the race condition problem.β
Pros:
- Atomic operations - no race conditions
- Fast - Redis is in-memory
- Accurate rate limiting
Cons:
- Redis becomes bottleneck
- Single point of failure
- Network latency to Redis (~1-2ms)
Handling Redis as Bottleneck
Interviewer: βWhat if Redis canβt handle 1M requests/second?β
You: βWe need to scale Redis. Several approaches:β
Approach 1: Redis Cluster with Sharding
API Servers
β
Consistent Hashing by user_id
β
βββββββββββ¬ββββββββββ¬ββββββββββ
βRedis 1 βRedis 2 βRedis 3 β
βusers βusers βusers β
βa-h βi-p βq-z β
βββββββββββ΄ββββββββββ΄ββββββββββ
You: βShard by user ID to distribute load:
- Shard 1: users starting with a-h
- Shard 2: users starting with i-p
- Shard 3: users starting with q-z
This distributes both data and traffic.β
Approach 2: Local Cache + Redis
API Server 1:
[Local Cache] β [Redis Cluster]
Flow:
1. Check local cache first (fast, 0ms)
2. If miss, check Redis (1-2ms)
3. Update local cache
You: βAdd local in-memory cache on each API server:
Benefits:
- Reduces Redis load by 90%
- Sub-millisecond latency
- Still mostly accurate
Trade-off:
- Slightly less accurate (may allow 101 requests instead of 100)
- Acceptable since we only need 99.99% accuracyβ
Multi-Region Deployment
Interviewer: βWeβre deployed in US, EU, and Asia. How does this affect your design?β
You: βMulti-region adds complexity. Two approaches:β
Approach 1: Regional Rate Limits
US Region: EU Region:
[Clients] β [Redis US] [Clients] β [Redis EU]
Each region has independent counters
Pros:
- Low latency (no cross-region calls)
- High availability per region
Cons:
- Less accurate globally (user could hit limit in each region)
- Complex to enforce global limits
Approach 2: Global Rate Limits with Eventual Consistency
[US Redis] β Sync β [EU Redis] β Sync β [Asia Redis]
Sync every 1-2 seconds
Pros:
- True global limits
Cons:
- Higher latency
- Complex synchronization
- Potential conflicts
You: βIβd recommend Regional Rate Limits because:
- 10ms latency requirement eliminates cross-region calls (100+ ms)
- Most users stay in one region
- 99.99% accuracy allows regional approximation
For premium users needing strict global limits, we could use asynchronous sync with eventual consistency.β
Phase 5: Deep Dive Topics (10 minutes)
Handling Different Time Windows
Interviewer: βHow do you support 10/sec, 100/min, and 1000/hour simultaneously?β
You: βStore separate token buckets for each window:β
Redis structure:
rate_limit:user123:second β {tokens: 8, last_refill: 1641234567}
rate_limit:user123:minute β {tokens: 75, last_refill: 1641234560}
rate_limit:user123:hour β {tokens: 850, last_refill: 1641234000}
Check algorithm:
1. Check all buckets
2. If ANY bucket has no tokens β Reject
3. If ALL buckets have tokens β Allow + decrement all
Memory per user:
- 3 buckets Γ 16 bytes = 48 bytes
- 1M users = 48 MB (very manageable)
Rate Limiting by Multiple Dimensions
You: βWe might need rate limiting by:β
- User ID: user123 β 100/min
- IP address: 192.168.1.1 β 1000/min (prevent DDoS)
- API key: key_abc β 10000/day (tiered pricing)
- Endpoint: /api/expensive β 10/min (protect costly operations)
Implementation:
Check hierarchy:
1. Check global IP rate limit (1000/min)
2. Check user rate limit (100/min)
3. Check endpoint rate limit (10/min)
Reject if ANY limit exceeded
Redis keys:
rate_limit:ip:192.168.1.1:minute
rate_limit:user:user123:minute
rate_limit:endpoint:/api/expensive:minute
Handling Distributed Counters Race Conditions
Interviewer: βEven with Redis Lua scripts, what if two API servers call Redis at exactly the same time?β
You: βRedis is single-threaded, so it processes commands sequentially:
Time Server 1 Redis Server 2
0ms CheckLimit(user123)
1ms tokens=10β9
2ms β Allowed CheckLimit(user123)
3ms tokens=9β8
4ms β Allowed
No race condition - Redis handles serially
This is one reason Redis is perfect for rate limiting - single-threaded model prevents races.β
Rules Configuration and Updates
Interviewer: βHow do you manage rate limit rules? What if we need to change limits without redeploying?β
You: βWe need a rules configuration system:β
Architecture:
[Admin Dashboard]
β
[Rules Service] β [PostgreSQL]
β
[Config Cache] β [API Servers fetch periodically]
Database Schema:
CREATE TABLE rate_limit_rules (
id SERIAL PRIMARY KEY,
user_tier VARCHAR(50), -- free, premium, enterprise
endpoint VARCHAR(255),
requests_per_second INT,
requests_per_minute INT,
requests_per_hour INT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
Example rules:
free tier: 10/sec, 100/min, 1000/hour
premium tier: 100/sec, 1000/min, 10000/hour
Update flow:
- Admin updates rule in dashboard
- Rules service writes to PostgreSQL
- API servers poll for updates every 30 seconds
- Update local cache
- New limits take effect
Trade-off: 30-second delay for updates vs real-time (acceptable for most use cases)
Complete Architecture
Final Design:
[Clients]
β
[API Gateway Cluster]
β
βββββββββββββββββββ΄ββββββββββββββββββ
β β
[Rate Limiter Middleware] [Rules Config Service]
β β
[Local Cache] [PostgreSQL]
β
[Redis Cluster - Sharded by user_id]
β
[Shard 1] [Shard 2] [Shard 3] ... [Shard N]
Request Flow:
When request arrives:
- API Gateway receives request
- Extract user ID from request
- Check local cache for rate limit status (fast path)
- If local cache miss, query Redis with Lua script
- Decrement tokens if available
- Update local cache
- Return 200 (allowed) or 429 (rate limited)
- Add rate limit headers to response
Performance Characteristics:
- Cache hit (90% of requests): under 1ms latency
- Cache miss: 1-3ms latency (Redis query)
- Throughput: 1M+ requests/second (distributed across API servers)
- Accuracy: 99.99%
- Availability: 99.99% (Redis cluster with replication)
Common Follow-up Questions
Q: βHow do you handle clock skew in distributed systems?β
You: βClock skew can cause issues with token refill calculations.
Solution:
- Use Redis server time (
TIMEcommand) instead of application server time - Redis acts as single source of truth for time
- Eliminates clock drift between API servers
In the Lua script:
local now = redis.call('TIME') -- Use Redis time, not server time
This ensures consistent time across all API servers.β
Q: βWhat if Redis goes down?β
You: βWe need a failover strategy:
Option 1: Fail Open (Allow All)
- If Redis unavailable β Allow all requests
- β Service stays up
- β No rate limiting during outage
Option 2: Fail Closed (Reject All)
- If Redis unavailable β Reject all requests
- β Protect backend services
- β Poor user experience
Option 3: Degrade to Local Rate Limiting
- If Redis unavailable β Use local in-memory counters
- β Service stays up with some protection
- β Less accurate (per-server limits)
Iβd recommend Option 3 with Redis high availability:
- Redis Cluster with replication (master-replica)
- Automatic failover with Redis Sentinel
- Target: 99.99% Redis uptime
During rare outages, degrade to local limiting.β
Q: βHow do you handle sudden traffic spikes?β
You: βToken bucket naturally handles small bursts, but for large spikes:
Strategy 1: Circuit Breaker
If error rate > 50% for 10 seconds:
β Open circuit
β Reject requests faster (fail fast)
β Try again after 30 seconds
Strategy 2: Priority Queue
- High-priority requests (paying customers) β Always allow
- Low-priority requests β Rate limit aggressively
- Implementation: Multiple token buckets per user tier
Strategy 3: Adaptive Rate Limiting
Monitor system health:
If CPU > 80% β Reduce all rate limits by 50%
If CPU < 50% β Restore normal limits
This prevents cascading failures.β
Q: βHow would you add analytics and monitoring?β
You: βTrack these metrics:
Real-time Metrics (to Redis/Prometheus):
- Total requests/second
- Rejected requests/second
- Rate limit hit rate by user, endpoint
- p50, p95, p99 latency of rate limit checks
Long-term Analytics (to Data Warehouse):
- Async write rejected requests to Kafka β S3
- Analyze patterns: which users hit limits, when, why
- Inform capacity planning and limit adjustments
Alerting:
- Alert if rate limit rejection rate > 10%
- Alert if Redis latency > 5ms
- Alert if Redis unavailable
Dashboard:
- Show rate limit usage per user tier
- Show top users hitting limits
- Show system-wide capacityβ
What Makes This Answer Strong
This walkthrough demonstrates:
- Algorithm knowledge - explained 5 algorithms with trade-offs
- Distributed systems expertise - handled race conditions, sharding, multi-region
- Production readiness - covered monitoring, failover, configuration
- Trade-off analysis - accuracy vs latency, fail open vs closed
- Scalability - designed for 1M requests/second
- Real-world experience - included headers, status codes, best practices
Practice Exercise
Set a timer for 45 minutes and design a rate limiter:
- Choose an algorithm (start with token bucket)
- Design single-server version
- Scale to distributed system
- Add monitoring and configuration
- Handle edge cases (clock skew, Redis failure)
Key questions to ask yourself:
- Where is the rate limiter placed?
- How do I prevent race conditions?
- What happens during Redis outage?
- How do I update limits without downtime?
The more you practice, the more these patterns become second nature.
Now go protect some APIs.
Ready to Ace Your System Design Interview?
Practice with our AI interviewer and get instant feedback on your approach
Start AI Interview For Free