Name: Building a Production-Grade API Rate Limiter from Scratch | Ajala Nalawade posted on the topic | LinkedIn
Uploaded: 2026-04-27T04:47:31.817Z
Duration: 52 s
Channel: Ajala Nalawade

Ajala Nalawade

6d Edited

HTTP 429 isn’t an error. It’s a decision. And I built the system that makes that decision. Most developers have hit rate limits. Very few understand how they actually work under the hood. So I built a production-grade 𝐀𝐏𝐈 𝐑𝐚𝐭𝐞 𝐋𝐢𝐦𝐢𝐭𝐞𝐫 from scratch — not a clone, not a tutorial. ➣ What it does Controls how many requests a client can make within a time window: • Within limits → HTTP 200 ✅ • Cross the limit → HTTP 429 🚫 This is what protects real APIs from: ✓ bot traffic ✓ abuse ✓ infrastructure overload ➣ 3 algorithms. 3 different trade-offs. • 𝐓𝐨𝐤𝐞𝐧 𝐁𝐮𝐜𝐤𝐞𝐭 → absorbs burst traffic (user-facing APIs) • 𝐒𝐥𝐢𝐝𝐢𝐧𝐠 𝐖𝐢𝐧𝐝𝐨𝐰→ fair distribution, no boundary exploits • 𝐋𝐞𝐚𝐤𝐲 𝐁𝐮𝐜𝐤𝐞𝐭 → strict constant rate (payments, critical systems) 👉 Switch between them LIVE — no restart, no downtime. ➣ Where theory meets reality 1. 𝐑𝐚𝐜𝐞 𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧: Two requests see “1 token left” → both pass → Fixed using serialized writes + DB transactions 2. 𝐖𝐫𝐢𝐭𝐞 𝐥𝐨𝐜𝐤 𝐜𝐨𝐧𝐭𝐞𝐧𝐭𝐢𝐨𝐧 : High traffic = silent failures → Fixed with retry logic + scoped transactions 3. 𝐑𝐮𝐧𝐭𝐢𝐦𝐞 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 𝐬𝐰𝐢𝐭𝐜𝐡𝐢𝐧𝐠: Changing logic without breaking user state or API keys → Required careful state isolation 🛠️ 𝐒𝐭𝐚𝐜𝐤 : Python · FastAPI · SQLite · Vanilla JS · Chart.js 🔗 Live Demo: https://lnkd.in/dmK_WF6V 💻 GitHub: https://lnkd.in/dnK5AAPZ Built this from scratch to understand how production systems think about traffic control. 🚀 Would really appreciate feedback — especially from engineers who've worked on distributed systems or high-traffic APIs. #BackendEngineering #Python #FastAPI #SystemDesign #SoftwareEngineering #Backend

To view or add a comment, sign in

More Relevant Posts

Smit Mewada
1w
Report this post
My observability tool was lying to me. And it took me a full day to figure out why. I was wiring Langfuse into my LangGraph pipeline to trace every node — latency, token usage, eval scores. Standard production setup. I installed the SDK. Connected it. Sent a trace. HTTP 200. Auth successful. No errors. I opened the Langfuse UI. Zero traces. I sent another. Still 200. Still nothing. For hours I assumed the bug was in my code. Wrong callback path. Wrong credentials. Wrong graph configuration. I rewrote the integration three times. Then I read the actual error more carefully. The SDK was version 4. The server was version 3. SDK v4 uses OpenTelemetry ingestion protocol. Server v3 uses the classic ingestion protocol. They're incompatible. But the server still returned 200 — it accepted the request, then silently rejected the payload format it didn't understand. "Auth OK" did not mean "traces are arriving." The fix was two lines: → Pin SDK to langfuse==2.60.0 → Pin server image to langfuse/langfuse:2 But the real lesson was bigger than the fix: Silent failures are the hardest bugs to debug — not because the problem is complex, but because there's no signal pointing you toward it. Everything looks fine. That's exactly when you should be suspicious. Three rules I now follow for any observability or infrastructure tool: 1. Always pin both the SDK and the server to the same major version explicitly 2. "No error" is not the same as "it worked" — verify data actually arrived 3. Test with a raw HTTP call before debugging your integration layer Building a production RAG system over Canadian financial regulations. Sharing every painful lesson along the way. #MLEngineering #LLMOps #Observability #BuildingInPublic #Python #LangGraph
Like Comment
To view or add a comment, sign in
Hamza Khan
6d Edited
Report this post
I built a RAG layer for Claude Code that cuts token usage by 80–90% Most devs using Claude Code don't realize they're burning tokens on files Claude doesn't need to read. Ask Claude "how does auth work?" and it reads 3 full files — 1,500+ tokens just to answer with 40 relevant lines. I fixed that. What I built: A local hybrid RAG system that sits between Claude and your codebase: → Late chunking — splits every file into overlapping 40-line windows → Dense retrieval — semantic search with all-MiniLM-L6-v2 (runs fully local, no API key) → BM25 sparse retrieval — keyword matching for exact symbol names → Cross-encoder reranking — picks the 3 best chunks from 20 candidates → File watcher — auto-rebuilds the index within 2 seconds of any file save Claude Code reads the CLAUDE.md and knows: run pip package before opening any file. It gets back 3 precise snippets with file path + line range. It reads only those lines. Nothing else. Real numbers on my Volta Engine project (76 files): - Without RAG: 17,235 chars across 3 files for one question - With RAG: 3,073 chars the exact 3 chunks that matter - 82% fewer tokens. Same answer. The whole thing runs offline. No cloud embeddings. No API calls. Just a one time pip install and run it. Stack: sentence-transformers · rank-bm25 · watchdog · Python If you use Claude Code daily on a real codebase, this pays for itself in the first session. DM me if you want the scripts. 🧠 #AI #ClaudeCode #RAG #DeveloperTools #Python #LLM #Productivity
Like Comment
To view or add a comment, sign in
Roei Michael
2w Edited
Report this post
Ever wonder why your Claude Code session suddenly burned 50k tokens in one turn?🐱 If you use Claude Code a lot, i am sure you have also hit this wall, your session suddenly gets expensive, context fills up unexpectedly, and you have no idea why. Was it that Bash command that searched your entire repo? The Read that loaded a 3,000-line config file? You're left guessing. I spent the past week building CAT (Context Analyzer Terminal) to solve exactly that. What it does: → Hooks silently into Claude Code sessions → Tracks token cost per individual tool call — Read, Bash, Grep, etc. → Builds rolling baselines using Welford's algorithm → Fires a real-time alert the moment something exceeds your normal baseline (Z-score detection) → Gives you a plain-English explanation of why something was expensive → Shows burn rate projection, cache efficiency, and overhead ratio → Live Rich TUI dashboard — runs entirely locally The non-obvious engineering problem: Claude Code hooks fire tool events and token snapshots as two separate streams — neither includes the other's data. The core of CAT is a delta engine that correlates them by session ID and timestamps to compute per-call cost attribution. Setup is 3 commands. MIT licensed. 113 tests. CI passing on macOS, Ubuntu, and Windows across Python 3.11–3.13. 🔗 GitHub: https://lnkd.in/dV69pHvs I'm actively looking for contributors — there are curated good-first-issues ranging from one-liners to full features. If you're into Python, async systems, or developer tooling, take a look. What token visibility features would make Claude Code more useful for you? Drop a comment — building this in public and all feedback shapes the roadmap. כבר לא חתול בשק! 🐱 בניתי כלי open-source שחוסך לכם את הניחושים ומראה בדיוק איזה tool call "אוכל" לכם את ה-context window ב-Claude Code. כל מי שמשתמש ב-Claude Code מכיר את הרגע הזה: הסשן פתאום נהיה יקר, הקונטקסט מתמלא בלי התראה, ואין לכם מושג למה. האם זה היה ה-Bash command שסרק בטעות את כל הריפו? או קובץ קונפיגורציה ענק שנטען ב-Read? בשבוע האחרון פיתחתי את CAT (Context Analyzer Terminal) כדי לפתור בדיוק את זה. מה זה נותן? ← ניטור שקט של סשנים ב-Claude Code. ← מעקב אחרי עלות טוקנים לכל פעולה בנפרד (Read, Bash, Grep וכו'). ← זיהוי חריגות בזמן אמת (Z-score) המבוסס על אלגוריתם Welford. ← הסברים ברורים למה פעולה מסוימת הייתה יקרה. ← תחזית Burn rate, יעילות Cache ויחס Overhead. ← Dashboard מקומי בטרמינל (Rich TUI). האתגר ההנדסי כאן היה לחבר בין שני סטרים נפרדים של מידע (tool events ו-token snapshots) שקלאוד מוציא ללא זיהוי מקשר. המנוע של CAT מבצע קורלציה מבוססת זמן ו-session ID כדי לייחס עלות מדויקת לכל קריאה. ההתקנה פשוטה (3 פקודות), הקוד ב-MIT, ויש כבר מעל 100 טסטים שעוברים ב-CI. אני מחפש תורמים לפרויקט! יש המון good-first-issues פתוחים. אם אתם בתוך Python, מערכות async או dev-tools — מוזמנים להציץ בגיטהאב 🔗 GitHub: https://lnkd.in/dV69pHvs #OpenSource #Python #DeveloperTools #ClaudeCode #AI #BuildInPublic

2 Comments
Like Comment
To view or add a comment, sign in
Jayesh Gupta
1mo
Report this post
One fine morning, a customer reported: “File upload sometimes fails…” Not always. Not consistently. Just sometimes. 😄 And of course, those are the best bugs. 👉 System handles 1000+ uploads daily 👉 Issue happens randomly (10–20 times) 👉 Chunk upload + merge logic (unchanged for years) 👉 Stateless architecture (or so I thought…) I jumped into debugging mode. After hours of checking: NFS configs ✅ Multi-server behavior ✅ Retry logic ✅ Logs (100 times) ✅ Observation: Chunks uploaded from Server A were not visible on Server B immediately (10–15 sec delay). Confusion level: 🔥🔥🔥 Then I did something simple (and often ignored)… 👉 Compared old vs new code Guess what changed? Just one line removed (thanks to Sonar cleanup 😅): HttpSession session = request.getSession(); And that innocent line was silently adding JSESSIONID, making requests sticky and hiding the real problem all along. 💡 So for years, reality was something like this: Stateless system... except when upload API enters the chat 😄 Or simply: stateless most of the time, secretly stateful during uploads 🎭 And the moment I removed an “unused variable”… 💥 Load balancing started behaving correctly 💥 NFS delays became visible 💥 Hidden dependency got exposed 💥 Bug said: Hello 👋 I was always here And the best realization: 👉 My application is perfectly stateless… 👉 Until the user hits the upload API and boom, it becomes emotional (stateful) 🤣🤣🤣 Lesson learned: Sometimes the bug is not in new code… It’s in removing the wrong old code 😄 And sometimes… Your system isn’t broken, your assumptions are. Still one mystery remains: 👉 Why exactly NFS behaved that way (never got a perfect answer 😅) #BackendStories #ProductionIssues #Java #NFS
Like Comment
To view or add a comment, sign in
Jeff Gibbs
4w
Report this post
Day 5: Background Tasks, WebSockets & Async HTTP Most APIs are request-response machines. Client asks. Server answers. Done. Day 5 of the Starlette series breaks that model entirely. We add three patterns that move a backend from "handles requests" to "does work": → Background tasks — the database write returns a 201 immediately; the external sync happens after the client is already gone → Async outbound HTTP with HTTPX — non-blocking calls to external services that don't stall the event loop → WebSocket broadcasting — connected clients receive a live update when the background job completes, without polling The flow that ties it all together: Client POSTs a task → gets 201 immediately → background sync fires → WebSocket clients receive the update The client never waits on the external service. If the sync fails, the record is already in the database. That's the architecture: write fast, respond fast, sync after. After five days: routing, a real database with migrations, token auth, middleware, CORS, async HTTP, background work, and live WebSocket updates. That's not a tutorial toy. That's a foundation you can build a real frontend on. Read the full walkthrough → link in the comments #Python #Starlette #WebSockets #BackendDevelopment #APIDevelopment

1 Comment
Like Comment
To view or add a comment, sign in
Ravi Ranjan Kumar
3w
Report this post
Day 30/100: Making Software Robust with Error Handling & JSON! Today was all about building resilient applications. I moved beyond simple text files and dived into structured data management and exception handling. Key Technical Takeaways: Exception Handling: Mastering try, except, else, and finally blocks to prevent the app from crashing when unexpected errors occur. JSON Data Management: Transitioning from .txt to .json for a more structured, nested data format. Learned how to write, update, and read JSON using the json library. Search Functionality: Added a "Search" feature to the Password Manager, allowing the app to find and display stored credentials with a single click. User Experience: Handling cases where a user searches for a website that doesn't exist in the database yet. Handling errors and structured data is what separates a "script" from a "professional application." Feeling more confident in building production-ready code! Check out my upgraded Password Manager here: https://lnkd.in/ghRt6Gtk #Python #JSON #ErrorHandling #SoftwareEngineering #100DaysOfCode #VSCode #CleanCode
Like Comment
To view or add a comment, sign in
Jeff Gibbs
1mo
Report this post
An API that anyone can call isn't really an API — it's an open database. Day 4 of the Starlette series is about fixing that. We add a full middleware stack: token-based authentication, custom request logging, and CORS support so a browser frontend can actually talk to the API. No extra packages — everything is built into Starlette. The most important thing I covered is the middleware execution order, and why it matters: → CORS runs first — browsers send a preflight before the real request; if auth runs before CORS, a 401 with no CORS headers looks like a network failure → Logging wraps auth — so you can see the authenticated result in every log line → Auth resolves last — populating request.user before your handler ever runs The auth system itself has a nuance worth understanding: a missing token and a bad token are two different failure modes. Missing = anonymous request (handled in the route). Bad = AuthenticationError (handled by middleware before the route runs). The hardcoded token here is intentionally a scaffold. The pipeline — a custom AuthenticationBackend, AuthenticationMiddleware, and request.user checks in handlers — is exactly what you'd keep when moving to JWTs or OAuth. You'd only swap the validation logic inside backend.py. The layered architecture from Days 2 and 3 continues to pay off. The routes barely changed — they just gained a require_authentication() call at the top. Read the full walkthrough → link in the comments #Python #Starlette #WebDevelopment #BackendDevelopment #APIDevelopment

1 Comment
Like Comment
To view or add a comment, sign in
Ricky S
3w Edited
Report this post
Day 99: Square Root Decomposition & Prefix Multiplications ⚡ Problem 3655: XOR After Range Multiplication Queries II Yesterday’s brute force approach hit a wall today with a TLE (Time Limit Exceeded). The constraints were significantly tighter, requiring a more sophisticated optimization. The Strategy: Square Root Decomposition To handle the queries efficiently, I split the problem based on the step size k: • Large Steps (k ≥ √N): For large gaps, the number of updates is small enough that direct simulation still works within time limits. • Small Steps (k < √N): This is where the magic happens. For small k, I used a Difference Array technique modified for multiplications. • Modular Inverse & Prefix Products: Instead of updating every index, I marked the start (L) and end (R) of the range. I used modInverse to "cancel out" the multiplication after the range ended. A final prefix product pass (jumping by k) applied all updates in O(N) time. Technical Highlights: • Fermat's Little Theorem: Used modPow(x, MOD - 2) to calculate the modular inverse for division. • Complexity: Reduced the worst-case runtime from O(Q⋅N) to O((Q+N)√N). One day away from 100, but the focus remains on the problem in front of me. Consistency isn't about the destination; it's about the quality of the journey. 🚀 #LeetCode #Java #Algorithms #DataStructures #SquareRootDecomposition #DailyCode
Like Comment
To view or add a comment, sign in
Moriel Harush
2w Edited
Report this post
New research drop alert 🚨 When Replicator NPM Turned a JSON Blob into an RCE. Let's call it... Replican't. CVE-2026-2265 - deserialization vulnerability in replicator (~1M monthly downloads). The library let attacker-controlled input choose which constructor to call and hands you arbitrary code execution. The kicker? This package already had a deserialization CVE in 2021. Different researcher, different angle, different bug. Same trust assumption sitting there for years. Full writeup: https://lnkd.in/ebZCkrvB #cve-2025-2265 #cve #replicator #npm #supply_chain

Replican't: When Deserialization Starts Writing Your Script morielharush.github.io

1 Comment
Like Comment
To view or add a comment, sign in
Kshitij Gupta
2w Edited
Report this post
🧠 LeetCode POTD — The Bug Wasn’t Logic… It Was Leading Zeros 3761. Minimum Absolute Distance Between Mirror Pairs At first glance, this problem looked simple. Find two indices (i, j) such that: 👉 reverse(nums[i]) == nums[j] and return the minimum distance. My first instinct was straightforward: 👉 Store all numbers in a map 👉 Reverse the current number 👉 Check if it already exists Simple enough. 💥 But then one small edge case caused issues: Leading zeros Example: 120 → 21 Not 021 So if you think in strings, it’s easy to make mistakes. 💡 The cleaner approach: Instead of storing original numbers first, 👉 Reverse each number mathematically 👉 Store the reversed value with its latest index 👉 If current number already exists in map, we found a mirror pair Why this works: If we process: 120 We store: 21 Later when 21 appears, we instantly know it matches. 📌 Best part: Mathematical reversal automatically handles leading zeros. 120 → 21 300 → 3 101 → 101 No extra checks needed. 💡 What I liked about this problem: The challenge wasn’t data structures. It was noticing that a small representation detail changes the whole solution. Sometimes bugs are not in algorithms. They’re hidden inside edge cases. Curious — did anyone else first think of using strings here? 👀 #LeetCode #ProblemSolving #HashMap #SoftwareEngineering #DSA #SDE #Java #C++
Like Comment
To view or add a comment, sign in

453 followers

6 Posts

View Profile Follow

More Relevant Posts

Explore content categories