Idempotency in Distributed Systems: A Common Bug and Easy Fix

One of the most common distributed-systems bugs I've seen across different teams is the duplicate-charge or duplicate-action problem. It usually comes down to a missing idempotency key. The pattern: a client retries a request after a timeout. The original request had actually succeeded - the response just never made it back. Result: two charges, one unhappy user, one urgent on-call ticket. Idempotency in REST APIs is one of those topics that sounds straightforward until it surfaces in production. The mental model I use: GET, PUT, DELETE → idempotent by HTTP spec. Same request, same result, no side effects on retry. POST → not idempotent by default. This is where most idempotency bugs come from. The fix isn't complicated. For any POST that creates or modifies state, accept an Idempotency-Key header from the client. Store the key and the response in a short-lived cache (Redis works well, with a TTL of a few hours to a day). On a retry with the same key, return the cached response instead of re-processing. Three things that often get missed: 1. The key has to be generated by the client, not the server. Server-generated keys defeat the purpose. 2. Cache the response, not just a "this key was used" flag. Otherwise the retry gets a different shape than the original. 3. Scope keys per endpoint or per user. Global key spaces lead to weird collisions. Idempotency is one of the lowest-cost, highest-value patterns you can build into a distributed system. Easy to skip during initial design. Painful to retrofit after an incident. What's your team's pattern for this - header-based, request-hash-based, or something else? #Java #SpringBoot #Microservices #DistributedSystems #BackendEngineering #SoftwareEngineering #APIDesign #TechLeadership

To view or add a comment, sign in

More Relevant Posts

Aliaksei Taliuk
3w
Report this post
🚀 Excited to share that JsonApi4j 1.5.0 is now live! 👉 https://lnkd.in/eSgd4sbs This release adds built-in caching to the Compound Documents Resolver - fewer downstream calls, faster responses for repeated include queries. How it works: → The resolver caches individual resources by resource type + id → On the next request, only cache misses trigger HTTP calls → Cache entries respect standard HTTP Cache-Control headers (max-age, no-store, etc.) → The final compound document response carries an aggregated Cache-Control header - the most restrictive directive across all included resources In-Memory Cache implementation is enabled by default - zero configuration needed. For distributed deployments, you can plug in your own implementation via the SPI, for example one for Redis. --- JsonApi4j is an open-source framework for building APIs aligned with the JSON:API specification, with a strong focus on developer productivity and clean architecture. If you're looking for a structured and flexible way to expose JSON:API endpoints — give it a try. Feedback and contributions are always welcome! 🙌 #java #jsonapi #opensource #api #caching
Like Comment
To view or add a comment, sign in
Prince Kumar
1mo
Report this post
Let me test you 👇 👉 What happens if your database goes down during a transaction? Most answers: ❌ “It fails” ❌ “Retry” But in real systems, you need: ✔ Transaction management ✔ Retry + idempotency ✔ Message queues ✔ Consistency handling This is exactly the kind of real interview thinking I covered in my PDF. 📘 Includes: System Design Spring Boot internals Redis caching Microservices patterns 👉 Link: https://lnkd.in/gbF6u9ni 🎁 Use JAVA25 for 15% OFF. Think like a backend engineer. Not just a coder. #SoftwareEngineering #Backend #JavaDeveloper
Like Comment
To view or add a comment, sign in
Jubril Tayo
2w
Report this post
Most backend systems don’t fail because of bugs. They fail because they can’t handle pressure. Here’s where many developers get it wrong: They confuse rate limiting with a circuit breaker. They solve completely different problems. Rate Limiting → protects your system from users If a user sends too many requests, you block or slow them down. This prevents abuse, spam, and overload. Circuit Breaker → protects your system from other services If a dependency starts failing, you stop calling it temporarily. This prevents cascading failures and wasted retries. In a recent email microservice I built: - Rate limiting stopped users from sending too many emails - Circuit breaker stopped the system from repeatedly calling a failing service Both worked together to keep the system stable under stress. If you only use one, you’re still exposed. Which one have you actually implemented before? #BackendEngineering #SoftwareEngineering #SystemDesign #Microservices #DistributedSystems #Python #Django #Redis #RabbitMQ #Tech
Like Comment
To view or add a comment, sign in
Gautam Mandoliya
1mo
Report this post
Seniority in #Java isn't measured in years. It's measured in #P99Latency. 📉 Your code didn’t fail. Your design assumptions did. #LocalDev assumes infinite #Threads and 0 latency, but production assumes nothing. The shift from #Local → #Scale is where "clean" logic dies because reality is harsher than any #JUnit model. In #HighTraffic systems, #GCTuning slippage and #ThreadModel flaws only appear when #ExecutionTiming starts to dictate your #SystemOutcomes. Your #ResponseTime distribution will never look like your backtest—and that’s when the real work begins. It's not about writing new features, but questioning the ones you trusted: → Is your #Concurrency model still valid? → Was it a solid edge, or just a favorable #LoadProfile? → Is your #VirtualThread implementation a solution or an illusion? #Syntax is fragile. #Architecture is everything. #Adaptation is survival. 10 Questions to test your #Scalability assumptions: #TrafficSpikes: Can your #SpringBoot service handle 10x load without a coldStart? #Java21: Are you leveraging #ProjectLoom for #NonBlocking I/O? ⚡ #Resilience4j: Do you have a #CircuitBreaker for every external #API? #ZGC: When do you pivot from #G1GC for sub-ms #PauseTimes? 📉 #NPlusOne: How do you audit #Hibernate fetches before they hit production? #Caching: When is #Caffeine better than #Redis for your #HitRate? #AsyncFlow: When does #REST become TechnicalDebt and require kafka?📦 #Tracing: Can you follow a #RequestID across 5 #Microservices? #RBAC: How do you handle #Auth without adding #SecurityLatency? 🛡️ #Idempotency: Does a double-click result in a #DuplicatePayment? 💳 Still testing. Still breaking assumptions. That’s the real side of #BackendEngineering.
Like Comment
To view or add a comment, sign in
Rani Dhage
1w
Report this post
Want to be a great backend engineer? Learn: 1. Programming Fundamentals Language depth, data structures, algorithms, concurrency, memory basics 2. APIs and Contracts REST, gRPC basics, versioning, idempotency, pagination, error handling 3. Databases SQL, joins, indexes, transactions, query optimization, schema design 4. Caching Redis, cache invalidation, TTLs, read-through vs write-through, hot key problems 5. Async and Messaging Queues, Kafka/RabbitMQ, retries, DLQs, at-least-once delivery, idempotent consumers 6. Distributed Systems Basics Replication, partitioning, consistency, leader-follower, network failures, backpressure 7. Reliability Engineering Timeouts, retries, circuit breakers, rate limiting, graceful degradation 8. Observability Logs, metrics, tracing, p95/p99 latency, debugging production issues 9. Infrastructure and Deployment Docker, CI/CD, Linux basics, cloud services, rollbacks, blue-green/canary deploys 10. Security Auth, authorization, secrets management, input validation, encryption, secure defaults Preparing for interviews? Start revising these today 𝗜’𝘃𝗲 𝗽𝗿𝗲𝗽𝗮𝗿𝗲𝗱 𝗶𝗻 𝗗𝗲𝗽𝘁𝗵 𝗝𝗮𝘃𝗮 𝗦𝗽𝗿𝗶𝗻𝗴𝗯𝗼𝗼𝘁 𝗯𝗮𝗰𝗸𝗲𝗻𝗱 𝗚𝘂𝗶𝗱𝗲, 𝟏𝟬𝟬𝟬+ 𝗽𝗲𝗼𝗽𝗹𝗲 𝗮𝗿𝗲 𝗮𝗹𝗿𝗲𝗮𝗱𝘆 𝘂𝘀𝗶𝗻𝗴 𝗶𝘁. 𝗚𝗲𝘁 𝘁𝗵𝗲 𝗴𝘂𝗶𝗱𝗲 𝗵𝗲𝗿𝗲: https://lnkd.in/dfhsJKMj keep learning, keep sharing ! #java #backend #javaresources
Like Comment
To view or add a comment, sign in
Akshay Ravish
1w
Report this post
We once deleted a user in production and watched half our database disappear in real time. 😅 Here's what happened: We were making live changes to our PostgreSQL database. One table was foreign-key linked to another. The moment we deleted a user record — CASCADE did exactly what it was told. All related data across linked tables: gone. Instantly. Panic mode. Everyone on a call. Hands shaking on keyboards. Then someone remembered — we had a background job quietly copying data into a separate, unlinked table. No one thought much of it at the time. That day, it saved us completely. We recovered everything. The users never knew. But that incident rewired how I think about backend engineering forever. 🧠 Here's what actually separates senior Node.js + Express engineers from the rest: 1. They don't just fix the bug — they ask why the system allowed the bug to exist in the first place. 2. They never make raw changes on a live database without a rollback plan. Never. 3. They design for failure first. Backups, redundancy, audit logs — before the happy path. 4. They understand the cost of abstractions. That ORM is hiding a CASCADE DELETE. They know it. 5. They treat Redis as a layer of safety too — cache the critical stuff, know what's ephemeral and what isn't. 6. They write code for the person debugging at 3am — who is probably future them. 7. They push back on "let's just do it quickly in prod." That's not being difficult. That's seniority. The real senior shift isn't knowing more frameworks. It's knowing that one overlooked FK constraint can ruin your entire afternoon — and building systems that survive human mistakes anyway. Senior engineers — what's your "we almost lost everything" story? Let's normalize talking about it. 👇 #NodeJS #ExpressJS #PostgreSQL #Redis #BackendDevelopment #JavaScript #DatabaseManagement #SystemDesign #ProductionIncident #DisasterRecovery #SoftwareArchitecture #SoftwareEngineering #SeniorDeveloper #BackendEngineer #TechLeadership #Programming #CodingLife #Developer #TechCommunity #LessonsLearned #100DaysOfCode
Like Comment
To view or add a comment, sign in
Ruslan Mukhamadiarov
4w
Report this post
𝐋𝐚𝐭𝐞𝐧𝐜𝐲 𝐢𝐬 𝐮𝐩, 𝐛𝐮𝐭 𝐧𝐨𝐭𝐡𝐢𝐧𝐠 𝐢𝐬 𝐟𝐚𝐢𝐥𝐢𝐧𝐠. 𝐖𝐡𝐞𝐫𝐞 𝐝𝐨 𝐲𝐨𝐮 𝐥𝐨𝐨𝐤 𝐟𝐢𝐫𝐬𝐭? 🔍 One of the worst production situations: Latency is growing 📈 Users feel it 😐 Logs are clean 🧼 Nothing is obviously broken ❌ Most teams waste time here. They search for errors 🔎 Restart pods 🔄 Jump between dashboards 📊 But when nothing is failing, the problem is rarely an exception. It is usually one of these: 1. 𝗦𝗰𝗼𝗽𝗲 𝗳𝗶𝗿𝘀𝘁 🎯 One endpoint or all? One instance or all? Reads, writes, or async? If you skip this, you debug the whole system instead of a slice 2. 𝗧𝗵𝗿𝗲𝗮𝗱 𝗽𝗼𝗼𝗹𝘀 🧵 Active threads, queue size, blocked threads. If all workers are busy, requests are not failing - they are waiting to run. 3. 𝗧𝗵𝗿𝗲𝗮𝗱 𝗱𝘂𝗺𝗽 📸 Look for: * repeated stack traces * WAITING / BLOCKED threads * DB connection waits * socket reads * lock contention This shows where execution is actually stuck. 4. 𝗚𝗖 𝗯𝗲𝗵𝗮𝘃𝗶𝗼𝗿 ♻️ Pause time, frequency, heap pressure. If latency spikes in waves, GC is often involved. 5. 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻 𝗽𝗼𝗼𝗹𝘀 🧩 DB, HTTP clients, Redis, broker. Exhausted pool = requests wait instead of fail. Classic “slow but no errors”. 6. 𝗤𝘂𝗲𝘂𝗲𝘀 & 𝗹𝗮𝗴 📊 Queue depth, consumer lag, retries. The system may look fine while work silently accumulates. 7. 𝗗𝗼𝘄𝗻𝘀𝘁𝗿𝗲𝗮𝗺𝘀 🌐 DB, internal services, external APIs. Your service might be slow because it is efficiently waiting on something else. The key shift: No errors does not mean no problem. ❗ It usually means the bottleneck is in waiting, saturation, contention, or backlog. Stop hunting for exceptions first. Start finding where time is spent. How do you usually localize the bottleneck first in this situation? 🤔 #backend #java #springboot #observability #performance #distributedsystems #productionengineering
6 Comments
Like Comment
To view or add a comment, sign in
Astha Murugesan
1w
Report this post
👋 Hello Connections, Recently, while working on a microservices feature, everything looked perfectly fine… until one downstream API started failing. At first, I handled it using Fallback. 👉 If the service fails → return a default response. It worked… but something felt off. All failures looked the same. Whether it was a not found, server error, or timeout — the response didn’t change. Debugging became harder, and the real issue stayed hidden. That’s when I explored FallbackFactory. With it, I could access the actual exception and handle failures more intelligently: Return meaningful responses based on the error Log the exact issue Improve overall system visibility 💡 What I learned Handling failure is not just about giving a backup response — it’s about understanding what failed and why. Fallback → quick and simple FallbackFactory → controlled and insightful ⚙️ This approach, combined with Spring Boot and Resilience4j, helps build more resilient microservices. ✨ More such learnings coming soon — stay tuned! #Microservices #Java #SpringBoot #BackendDevelopment #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Kanika Gosain
2w
Report this post
Day 8 — The Idempotency Bug User clicks “Pay”… nothing happens… clicks again. And suddenly — 💸 Payment deducted twice. ⸻ The Setup public void processPayment(PaymentRequest request) { paymentGateway.charge(request); orderService.markAsPaid(request.getOrderId()); } Looks correct. Works perfectly in testing. ⸻ The Problem In real-world systems: • Network retries • Client timeouts • Double clicks • Gateway retry mechanisms 👉 Same request gets processed multiple times Result: ❌ Duplicate payments ❌ Duplicate orders ❌ Broken trust ⸻ ✅ The Fix: Idempotency Key Every request must be uniquely identifiable. if (paymentRepository.existsByIdempotencyKey(request.getKey())) { return; // already processed } paymentGateway.charge(request); paymentRepository.save(request.getKey()); 🔥 Production-Grade Approach • Generate Idempotency-Key (UUID) per request • Store with status (PENDING / SUCCESS / FAILED) • Add unique constraint in DB • Return same response for repeated requests • Use Redis for fast lookup (optional) ⸻ 🧠 Senior Insight Idempotency is not just backend logic. It’s a contract between client + server. ⸻ 🎯 The Lesson Retries are unavoidable in distributed systems. Duplicate execution is not. ⸻ If your system handles payments, orders, inventory… 👉 Idempotency is a MUST. ⸻ #BackendDevelopment #Java #SpringBoot #Microservices #SystemDesign #Fintech #DistributedSystems #APIDesign
Like Comment
To view or add a comment, sign in
Eswar Adithya Yadav
4w Edited
Report this post
🚨 Most Developers Don't Realize This in Spring Boot... Everything works fine in the beginning. But as your project grows: ⚠ APIs slow down ⚠ Code becomes messy ⚠ Debugging becomes painful Here are some mistakes I’ve seen (and personally faced): ❌ Writing business logic inside controllers ❌ Ignoring database performance (no indexing, no pagination) ❌ Poor layering structure ❌ No proper logging or exception handling What actually helped me improve: ✅ Clean architecture (Controller → Service → Repository) ✅ Constructor-based dependency injection ✅ Query optimization + pagination ✅ Using Elasticsearch for fast search ✅ Writing scalable and maintainable APIs 💡 Biggest lesson: Backend development is not just about writing APIs — it's about designing systems that scale. Have you faced any of these issues in real projects?.. #SpringBoot #JavaDeveloper #BackendDevelopment #Microservices #SoftwareEngineering #CleanCode #Java #TechCareers #DevelopersLife #CodingJourney #Elasticsearch #PostgreSQL #API #SystemDesign #LearningInPublic #LinkedInTech
Like Comment
To view or add a comment, sign in

3,821 followers

3 Posts

View Profile Follow

Idempotency in Distributed Systems: A Common Bug and Easy Fix

More Relevant Posts

Explore related topics

Explore content categories