A candidate interviewing for a Senior Engineer @ Meta was asked to design a rate limiter. Another candidate at Google's L5 loop got hit with the same question. I've been asked this three times across different companies. Rate-limiting questions look simple until you add one layer of complexity: – Add distributed rate limiting? Now you're dealing with race conditions and clock skew. – Add multiple rate limit tiers? Welcome to priority queues and quota management. – Add per-user, per-IP, and per-API-key limits? Your Redis bill just exploded. Here's my personal checklist of 15 things you must get right when building rate limiters: 1. Always do rate limiting on the server, not the client → Client-side limits are useless. They’re easily bypassed, so always enforce limits on your backend. 2. Choose the right placement → For most web APIs, place the rate limiter at the API gateway or load balancer (the “edge”) for global protection and minimal added latency. 3. Identify users correctly → Use a combination of user ID, API key, and IP address. Apply stricter limits for anonymous/IP-only clients, higher for authenticated or premium users. 4. Support multiple rule types → Allow per-user, per-IP, and per-endpoint limits. Make rules configurable, not hardcoded. 5. Pick an algorithm that fits your needs → Know the pros/cons: – Fixed Window: Easy, but suffers from burst issues. – Sliding Log: Accurate, but memory-heavy. – Sliding Window Counter: Good balance, small memory footprint. – Token Bucket: Handles bursts and steady rates, an industry standard for distributed systems. 6. Store rate limit state in a fast, shared store → Use an in-memory cache like Redis or Memcached. Every gateway instance must read and write to this store, so limits are enforced globally. 7. Make every check atomic → Use atomic operations (e.g., Redis Lua scripts or MULTI/EXEC) to avoid race conditions and double-accepting requests. 8. Shard your cache for scale → Don’t rely on a single Redis instance. Use Redis Cluster or consistent hashing to scale horizontally and handle millions of users/requests. 9. Build in replication and failover → Each cache node should have replicas. If a primary fails, replicas take over. This keeps the system available and fault-tolerant. 10. Decide your “failure mode” → Fail-open (let all requests through if the cache is down) = risk of backend overload. Fail-closed (block all requests) = user-facing downtime. For critical APIs, prefer fail-closed to protect backend. 11. Return proper status codes and headers → Use HTTP 429 for “Too Many Requests.” Include headers like: – X-RateLimit-Limit, – X-RateLimit-Remaining, – X-RateLimit-Reset, Retry-After This helps clients know when to back off. 12. Use connection pooling for cache access → Avoid reconnecting to Redis on every check. Pool connections to minimize latency. Continued in Comments...
Implementing Rate Limiting
Explore top LinkedIn content from expert professionals.
Summary
Implementing rate limiting means setting rules to control how often someone can make requests to a service or API, preventing abuse and helping systems stay stable during high demand. This process involves selecting the right method, deciding what to measure (like users or IP addresses), and using fast storage to keep track of usage.
- Choose your algorithm: Consider the strengths of token bucket, leaky bucket, sliding window, or fixed window approaches based on your traffic patterns and needs for burst handling or fairness.
- Set limits thoughtfully: Define whether limits apply to users, IP addresses, or API keys, and set boundaries that balance user experience with protection against overload.
- Plan for scaling and failure: Use fast, distributed storage like Redis to track usage across servers, and decide how your system should behave if the rate limiter becomes unavailable.
-
-
🛑 "429 Too Many Requests" isn't just an error code; it's a survival strategy for your distributed systems. Stop treating Rate Limiting as a simple counter. To prevent crashes, you need the right algorithm. This visual explains the patterns you need to know. 𝐇𝐨𝐰 𝐰𝐞 𝐜𝐨𝐮𝐧𝐭: 1️⃣ Token Bucket: User gets a "bucket" of tokens that refills at a constant rate. Great for bursty traffic. If a user has been idle, they accumulate tokens and can make a sudden burst of requests without being throttled immediately. Use Case: Social media feeds or messaging apps. 2️⃣ Leaky Bucket: Requests enter a queue and are processed at a constant, fixed rate. Acts as a traffic shaper. It smooths out spikes, protecting your database from write-heavy shockwaves. Use Case: Throttling network packets or writing to legacy systems. 3️⃣ Fixed Window: A simple counter resets at specific time boundaries (e.g., the top of the minute). Easiest to implement but suffers from the "boundary double-hit" issue (e.g., 100 requests at 12:00:59 and 100 more at 12:01:01). Use Case: Basic internal tools where precision isn't critical. 4️⃣ Sliding Window Log: Tracks the timestamp of every request. Solves the boundary issue completely. It’s highly accurate but expensive on memory (O(N) space complexity) because you store logs, not just a count. Use Case: High-precision, low-volume APIs. 5️⃣ Sliding Window Counter: The hybrid approach. Approximates the rate by weighing the count of the previous window and the current window. Low memory footprint, high accuracy. Use Case: Large-scale systems handling millions of RPS. 𝐖𝐡𝐞𝐫𝐞 𝐰𝐞 𝐞𝐧𝐟𝐨𝐫𝐜𝐞 6️⃣ Distributed Rate Limiting: Essential for microservices. You cannot rely on local memory; you need a centralized store (like Redis with Lua scripts) to maintain a global count across the cluster. 7️⃣ Fixed Window with Quota: Often distinct from technical throttling. This is business logic—hard caps over long periods (months/years). Use Case: Tiered billing plans (e.g., "Free Tier: 10k calls/month"). 8️⃣ Adaptive Rate Limiting: The "smart" limiter. It doesn't use static numbers but monitors system health (CPU, memory, latency). If the system struggles, it tightens the limits automatically. Use Case: Auto-scaling systems and disaster recovery. 𝐖𝐡𝐨 𝐰𝐞 𝐥𝐢𝐦𝐢𝐭 9️⃣ IP-Based Rate Limiting: The first line of defense. Limits based on the source IP to prevent botnets or DDoS attacks. Use Case: Public-facing unauthenticated APIs. 🔟 User/Tenant-Based Rate Limiting: Limits based on API Key or User ID. Ensures one heavy user doesn't degrade performance for others ("Noisy Neighbor" problem). Use Case: SaaS platforms and multi-tenant architectures. 💡 For most production systems, Sliding Window Counter combined with Distributed Limiting is the gold standard. It offers the best balance of memory efficiency and user fairness. #SystemDesign #SoftwareArchitecture #API #Microservices #DevOps #BackendEngineering #RateLimiting #CloudComputing
-
You're in a backend interview. They ask: "Design a rate limiting system for an API used by millions. Where do you start?" Here’s how you impress 👇 1.Start with the goal: Prevent abuse, ensure fair usage, and protect system stability. 2. Choose a rate-limiting strategy: - Fixed window - Sliding window - Token bucket (most flexible for millions of users) - Leaky bucket 3. Decide on the granularity: - Per user? Per IP? Per API key? - Define the limit (e.g. 100 req/min) 4. Pick a fast, distributed store: Use Redis to track usage—fast reads/writes, with key expiry support. 5. Implementation flow: - Each request checks Redis. - If within limit → continue. - Else → return 429 Too Many Requests. 6. Make it scalable: - Use Redis clusters for horizontal scaling. - Hash keys for sharding. - Push logic to API gateways like Kong or Cloudflare Workers for global edge enforcement. 7. Add burst control and global override flags: Just in case you need to protect key endpoints during load spikes.
-
“Design a Rate Limiter” – sounds simple? It's a deep dive into latency, consistency, and system boundaries. Here’s how I’d break it down 👇 Step 1: Requirement Gathering Functional (FNS): Allow X requests per user/IP per Y time window Drop or queue extra requests after limit Load and enforce custom rate rules per user/service Works for public APIs and internal microservices Non-Functional (NFS): Low latency (<10ms) Highly available (99.99%) Horizontally scalable to millions of users Configurable per endpoint or client Graceful degradation under high load Step 2: Estimate Capacity Say: 10M users, each allowed 100 requests/min That’s ~16,000 RPS in steady-state Bursts? Easily 10x. Must support spiky traffic. 99 percentile latency must be low (<20ms) Step 3: Choose Your Algorithm Most common: Fixed Window Counter (simple, but sharp reset edges) Sliding Window Log (accurate, but memory-heavy) Sliding Window Counter (approximation, good balance) Token Bucket / Leaky Bucket (supports bursts, smoother) Pick based on needs: burst friendliness? memory bound? precise per second fairness? Step 4: Architecture Components: User sends request to the API Gateway Gateway checks rate rules from Redis or Cache Uses message queue to queue or delay overflow requests If over limit: request is dropped or failed (429) Else, forwarded to server Worker processes logs, enforces custom limits from Limit Rules DB System supports async/batch enforcement and graceful queuing ✅ Caching rules improves performance ✅ Message queue allows graceful fallbacks ✅ Redis enables fast, atomic counter ops Step 5: Trade offs comment me a tradeoffs you would discuss for this 👇
-
A candidate for an L5 role at Google failed their system design round because they couldn't explain tradeoffs well. The question was simple: "What store do you pick for a public API rate limiter?" The word "Redis" was the answer given within five seconds. It was not wrong but incomplete. Let me explain… High-scale design requires you to solve the constraints before you name a database. The storage choice should be the very last thing you decide. a) Define the performance requirements A rate limiter is a tax on every incoming request. You have to establish a latency budget before you look at any tech stack. – Exactness: Can you afford a 5% margin of error in the count? – Burst tolerance: How will the system react to a 10x spike in 100ms? – Coordination: Do multiple API nodes need to share a global counter? If the latency budget is under 1ms, a network call to a remote database is physically impossible. You have to keep the state local. b) Evaluate the storage tier trade-offs Every choice dictates how your API behaves when traffic hits. You are deciding where the complexity lives. – In-memory (Local): This is the fastest path. It uses the app’s own RAM. Latency is negligible, but every node has its own version of the truth. – Distributed (Redis): This allows all nodes to share a single counter. You get global consistency, but you add a network hop to every single API call. – Durable (SQL/NoSQL): Use this for billing-critical limits that must persist across restarts. The latency cost is massive. c) Design for failure behavior A centralized store is a single point of failure. If the rate limiter is down, you have to decide the fate of your API. – Fail open: You allow all traffic. This protects the user experience but risks a database meltdown during an attack. – Fail closed: You block all traffic. This protects the infrastructure but destroys your uptime. The store choice should support your fallback strategy. If you cannot fail closed, you likely need a hybrid approach with local overrides. d) Match the store to the constraint Finalize the decision using data. Avoid choosing a tool based on personal preference. – For high-speed APIs where global exactness is secondary, use local in-memory stores with sticky sessions. – For public APIs requiring a strict global ceiling, use a distributed cache like Redis or Memcached. – For billing-critical systems, use a local count that syncs to a durable store asynchronously. Start with the constraints. The tool name is just the final piece of the puzzle. Design for the failure scenario first.
-
That one client who hammers your API 1000x more than anyone else? Without safeguards, they can bring your entire system to its knees. Rate limiting is the quiet guardian that prevents this. It ensures that no single client, malicious or accidental, can overwhelm your backend. Think of it as a system’s traffic cop, regulating the flow so everything keeps running smoothly. 𝗪𝗵𝘆 𝗥𝗮𝘁𝗲 𝗟𝗶𝗺𝗶𝘁𝗶𝗻𝗴 𝗠𝗮𝘁𝘁𝗲𝗿𝘀: Rate limiters aren’t just about blocking requests. They serve several critical purposes: - Protect the System: Keeps servers running during sudden traffic spikes or DDoS attacks. - Fair Access: Ensures all users and clients get a fair share of resources. - Control Costs: Prevents unnecessary consumption that can inflate infrastructure costs. - Enforce SLAs: Supports tiered service plans by applying different limits based on subscription levels. - Improve Security: Mitigates abuse, preventing attacks or errant clients from monopolizing resources. 𝗖𝗼𝗿𝗲 𝗔𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝘀: Choosing the right algorithm depends on your traffic patterns and accuracy requirements. Here are the most common approaches: - Token Bucket: Requests need tokens; tokens refill at a steady rate. Allows bursts while keeping long-term rate in check. - Leaky Bucket: Requests flow out at a fixed rate. Smooths spikes but can be too rigid for sudden bursts. - Fixed Window Counter: Count requests in fixed intervals (e.g., 100 requests/minute). Simple, but bursty at window edges. - Sliding Window Log: Store timestamps for requests. Very accurate, but memory-intensive. - Sliding Window Counter: A hybrid of fixed + sliding. Tracks counts in current and previous windows, then calculates a weighted sum. More accurate, less memory than logs. 𝗕𝗲𝘀𝘁 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀: Implementing rate limiting effectively requires more than just an algorithm: - Publish clear limits: Include headers so clients can self-regulate. - Return meaningful errors: Use HTTP 429 with Retry-After guidance. - Enforce at the edge: Use gateways or CDNs to reduce backend load. - Combine local + global checks: Local counters for speed, global stores for fairness. - Monitor continuously: Track blocked requests, spikes in 429s, and limiter performance. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗣𝗶𝘁𝗳𝗮𝗹𝗹𝘀: Even well-intentioned rate limiting can go wrong: - Hot key overload: A single heavy user can dominate traffic. - Retry storms: Clients retrying simultaneously can overload systems. - Race conditions: Non-atomic counters can allow extra requests. - Ignoring client behavior: Clients may ignore Retry-After headers. - Gateway-only enforcement: Gateways may miss business-specific rules. Rate limiting is not about punishment. It’s about ensuring that every user gets a fair slice of your system, even under pressure. Rate limiting isn’t just throttling. It’s fairness, protection, and resilience built into your architecture. And most importantly, keeping the lights on.
-
Your rate limiter doesn't work at scale. Here's what actually happens under the hood. Every engineer has implemented a rate limiter. Almost nobody has implemented one that survives 100K+ RPS across a distributed fleet. Let me break down what actually happens when you move past the textbook. Step 1: Fixed window is broken and everyone starts here anyway Count requests in 1-minute buckets. User sends 60 requests at 0:59, another 60 at 1:01. Your limit is 60/min. They just sent 120 requests in 2 seconds. This is the boundary burst problem. Fixed windows create a 2x burst at every window edge. Every system design interview candidate knows this. Very few know what actually replaces it in production. Step 2: Sliding window log - correct but brutal Store the timestamp of every single request. On each new request, evict timestamps older than your window, count what's left. This is mathematically precise. It's also a memory disaster. 1M users × 100 requests/min = 100M timestamps in memory. Each timestamp is 8 bytes. That's ~800MB just for rate limiting state. And you're doing a range eviction + count on every single request. Nobody runs this in production at scale. But it's the foundation for understanding the trick that actually works. Step 3: Sliding window counter - the production sweet spot Here's what Cloudflare, Stripe, and most serious systems actually use. Split time into sub-windows (say, 1-second granularity for a 60-second window). Maintain a counter per sub-window. On each request: -> Sum all counters in the current window -> Weight the oldest sub-window by how much of it overlaps with the current sliding window So if you're 700ms into the current second, the sub-window from 60 seconds ago gets weighted at 0.3. One float multiplication and an array sum. Memory per user: 60 integers + 1 timestamp. About 250 bytes vs 800 bytes per request in the log approach. This is good enough for most systems. The error margin is tiny and always conservative - you might reject a few extra requests at boundaries, never let extra through. Step 4: Token bucket - when you need burst tolerance Rate limiting isn't always "block after N." Sometimes you want: allow bursts, but enforce an average rate over time. Token bucket does exactly this. -> Bucket holds max B tokens (burst capacity) -> Tokens refill at rate R per second -> Each request costs 1 token -> No tokens? Rejected. The beauty: you only store 2 values per user. Last refill timestamp and current token count. On each request: """" tokens_to_add = (now - last_refill) × R current_tokens = min(B, stored_tokens + tokens_to_add) If current_tokens >= 1, serve the request, decrement, update timestamp. """ That's it. This is what API gateways like Kong and AWS API Gateway use under the hood. It naturally smooths traffic without hard edges. Now make it distributed. This is where it gets ugly. To learn in detail: https://lnkd.in/gVm5dUHn
-
You’re in a backend interview. They ask: “How would you design a rate limiting system for an API used by millions of users?” Here’s how to approach it: First, clarify the goal: - Protect backend services from abuse/overload. - Ensure fair usage across users. - Maintain availability and performance under heavy traffic. Common algorithms to mention: - Fixed Window – simple, but unfair at boundaries. - Sliding Window Log – accurate but memory heavy. - Sliding Window Counter – balances fairness and efficiency. - Token Bucket / Leaky Bucket – industry standard, smooths bursts. Implementation details: - Store counters in a fast, centralized store (e.g., Redis). - Use atomic increments with TTL for window resets. - For distributed APIs, ensure coordination across nodes. Scaling considerations: - Sharding keys → distribute counters across nodes. - Approximation (e.g., probabilistic counters) to reduce memory. - Hierarchical limits → user-level, API-key-level, and global limits. Advanced topics (FAANG-level): - Distributed coordination → avoid hot keys in Redis. - Multi-region rate limits → sync across datacenters. - Adaptive limits → dynamic throttling based on system load. Trade-offs: - Token bucket allows bursts but protects average rate. - Sliding window provides fairness but costs more memory. - Fixed window is simple but can be gamed. How to answer in the interview: - Start with the goal (fairness, protection). - Propose an algorithm (Token Bucket is a safe bet). - Discuss scaling with Redis/distributed systems. - Highlight trade-offs and real-world issues. A strong closing answer: “I’d use a token bucket implemented with Redis for atomic ops, sharded across nodes for scale, and layered limits (user + global). This balances fairness, efficiency, and resilience at scale.” Want more interview tips: backendweekly.dev
-
This is how companies like Google & Meta handle peak traffic and protect shared resources. It all comes down to implementing rate limiting that prevents services from crashing under massive user loads, and for this purpose, the Token Bucket Algorithm is quite commonly used. ► What is the Token Bucket Algorithm? A rate-limiting algorithm that controls how requests are processed by a system, ensuring that they don't exceed a specified limit. It’s used to manage traffic and prevent system overloads. ► How It Works: 1. Token Bucket Initialization: - A bucket is created with a fixed capacity to hold tokens. 2. Token Generation: - Tokens are added to the bucket at a constant rate (e.g., 1 token per second). 3. Request Arrival: - When a request is made, the system checks the bucket for available tokens. 4. Token Consumption: - If a token is available, it is removed from the bucket, and the request is processed. 5. Request Limitation: - If the bucket is empty (no tokens available), the request is denied or delayed until new tokens are added. ►Large-Scale Application (e.g., Instagram, Facebook, Google): - Instagram: When users engage with content by liking posts or commenting, the Token Bucket Algorithm ensures that these actions are spread out over time. This prevents spikes that could overwhelm servers, maintaining consistent performance. - Google APIs: For services like Google Maps API, the algorithm limits the number of API calls a user can make in a given time frame. This protects the system from abuse and ensures fair resource allocation across millions of users. - Meta (Facebook): During high-traffic events, such as live streaming or viral posts, the algorithm manages the request rate to ensure that servers remain responsive, avoiding downtime. ► Why It’s Crucial for Large-Scale Systems: - Scalability: The Token Bucket Algorithm scales with user demand, handling millions of requests by evenly distributing load over time. - Fair Usage: It ensures that all users have equitable access to system resources by limiting the request rate per user or client. - Performance: By controlling the flow of requests, the algorithm prevents system overloads, ensuring that services remain fast and reliable even under heavy load. ► Real-World Example: When you click "like" on Instagram, Your request is checked against the token bucket. If tokens are available, your like is processed immediately. If not, you might experience a slight delay, preventing the system from being overwhelmed by too many likes at once.
-
A Senior Security Engineer candidate was asked to design an API rate limiting and DDoS protection system during his interview at AWS. Another candidate in a different loop at Meta got the same prompt. Rate limiting looks simple until you add one layer of reality: – Add distributed systems? Now you need coordination across regions without creating a single point of failure. – Add legitimate traffic spikes? Now your rate limiter becomes the bottle neck that kills your own product launch. – Add sophisticated bots? Now simple IP-based limiting is useless and you're in an arms race. – Add cost considerations? Now every check you add impacts latency and infrastructure spend. – Add false positives? Now you're blocking paying customers and the business is losing revenue. – Add DDoS at scale? Now your protection layer itself becomes the target and goes down first. Here's my checklist of 15 things you must get right when building API rate limiting and DDoS protection: 1. Start with the threat model and business requirements → Define what you're protecting against: credential stuffing, scraping, application-layer DDoS, or infrastructure overload. Different threats need different strategies. 2. Choose your rate limiting scope: per user, per IP, per API key, or hybrid → IP-based is easiest but breaks with NAT and VPNs. User-based is accurate but requires authentication. API key gives you control but can be shared or leaked. 3. Pick the right algorithm for your use case → Token bucket for burst allowance, leaky bucket for smooth rate, sliding window for accuracy, fixed window for simplicity. Each has different trade-offs on fairness and resource usage. 4. Design for distributed rate limiting without central coordination → Local counters with eventual consistency are faster than global locks. Accept slight over-limit in exchange for no single point of failure. Use gossip protocols or streaming aggregation. 5. Implement progressive response strategies, not just hard blocks → Start with warnings, then add latency, then CAPTCHA, then temporary blocks, then permanent bans. Give legitimate users a way out before you lock them out completely. 6. Build allowlists and denylists that don't become your weakness → Allowlists for known good actors (your mobile app, partner APIs, health checks). But make them auditable and time-limited. One compromised allowlisted key shouldn't bypass everything. 7. Detect and fingerprint beyond IP addresses → Use TLS fingerprinting, user agent patterns, behaviour analysis, request ordering, timing patterns. Bots will rotate IPs but often can't hide their fingerprint completely. 8. Separate volumetric DDoS protection from application-layer protection → Volumetric (network floods) needs edge protection at CDN/ISP level. Application-layer (slow lorris, HTTP floods) needs intelligent rate limiting closer to your application. Different layers, different tools. -- 📢 Follow saed for more ♻️ Share for the benefit of another
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development