Retries Can Amplify Failures in Distributed Systems

Yesterday, I shared insights on blue-green deployment. Today, I want to highlight a small shift in thinking that transformed how I design backend systems: Retries don’t fix failures; they can amplify them. Early in my career, my instinct was straightforward: “If a request fails, just retry.” However, in distributed systems, this approach can quietly destabilize your system. Here’s what actually occurs: - A downstream service slows down - Upstream services start retrying - Traffic multiplies - Queues grow - Latency spikes - Everything starts timing out Instead of recovering, your system begins to spiral. What changed for me was recognizing retries as a design decision rather than merely a code pattern. In Java-based microservices, I now focus on: - Timeouts define boundaries - Retries must be intentional, not default - Backoff spreads load over time - Jitter prevents synchronized spikes - Circuit breakers protect failing dependencies - Idempotency makes retries safe for writes The goal is not to “make every request succeed.” The goal is to protect the system when things go wrong. This shift in mindset distinguishes code that works from systems that thrive in production. #BackendEngineering #Java #DistributedSystems #SystemDesign #Microservices #ResilienceEngineering #Scalability #CloudNative #SoftwareEngineering #TechCareers

To view or add a comment, sign in

More Relevant Posts

Ahmad Tabash
3w
Report this post
🏗️ Microservices vs Monolith — Which Should You Choose? This is one of the most debated topics in software architecture. Here's my honest take after working on both 🔵 Choose Microservices when: 📈 You need to scale independently Scale only the services under heavy load — not the entire application. 👥 You have large teams Each squad owns, deploys, and maintains their own service. No stepping on each other's toes. 🔧 You need different tech per service Use C# for one service, Node.js for another — best tool for each job. ❌ Downside: High complexity. Network calls, distributed tracing, and a heavy DevOps burden. Not for the faint-hearted. 🟣 Choose Monolith when: 🚀 You're at an early stage Ship fast, validate your idea, then optimize. Don't over-engineer before you have users. 👤 You have a small team Less infrastructure means more focus on building features that matter. 🐛 You want easier debugging Everything is in one place — no distributed complexity to trace through. ❌ Downside: One bottleneck slows the entire application. Harder to scale as you grow. 💡 My take: Start with a well-structured Monolith. Move to Microservices only when you have a real scaling problem — not an imaginary one. Which architecture do you use at work? #SoftwareArchitecture #Microservices #BackendDevelopment #CSharp #DotNet #SoftwareEngineering #Programming
Like Comment
To view or add a comment, sign in
Krishna Porje
3w
Report this post
Containers solve "it works on my machine," yet often create *new* developer headaches. Containerization promises unparalleled consistency from dev to production. But the dream of "local-prod parity" quickly crumbles if local setup is slow, complex, or different. Developers spend precious hours debugging environment issues instead of building features, impacting the entire release cycle. * Design your `docker-compose` for local services to closely mirror production architecture for true parity. * Optimize Dockerfile build stages and layer caching rigorously for lightning-fast local rebuilds. Skip unnecessary steps. * Integrate essential developer-friendly tools and debugging utilities directly into your dev containers. Think debuggers, linters, hot-reloading. A friction-less containerized dev environment directly translates to faster feature delivery and happier engineers. What's your top tip for maximizing developer productivity with containers? #Containerization #DeveloperExperience #DevOps #Productivity #Docker
Like Comment
To view or add a comment, sign in
Vishu Kalier
3w
Report this post
Most systems don’t fail because of complexity—they fail because of inconsistency. When every API speaks a different language, debugging becomes guesswork and scaling becomes chaos. In microservices architectures used by Netflix, Amazon and many more, response standardization is a foundational design decision, not just a coding preference. As shown in the architecture, each endpoint returns a common base response while extending it for specific needs . This ensures uniform communication across layers without sacrificing flexibility. Here’s how standardization is achieved and why it matters: • Define a base response model (e.g., success flag, message) shared across all endpoints • Extend it using inheritance or composition to include endpoint-specific data (userID, conversationID, lists) • Enforce consistent response structure at the endpoint layer, regardless of internal logic • Separate concerns by keeping response shaping independent from business logic Its not just about consistent response and SOLID principles, the benfits are astounding making complex systems simple for end users (abstraction at scale)- • Predictable API contracts → easier frontend integration • Faster debugging → uniform error handling and logs • Reduced duplication → centralized response structure • Scalability → new features plug into an existing contract seamlessly In essence, standardized responses act as a contract of trust between services and consumers, enabling systems to evolve without breaking. How do you ensure consistency in your APIs as systems grow in complexity? Let’s talk about your way to standardize API designs. Follow Vishu Kalier for more such architectural deep dives about System Design and real world systems. #SystemDesign #Microservices #BackendEngineering #APIDesign #SpringBoot #SoftwareArchitecture #ScalableSystems #Java #DesignPatterns

1 Comment
Like Comment
To view or add a comment, sign in
VENKATA JAINENDRABABU VANAMALA
2w
Report this post
🚨 𝐌𝐲 𝐊𝐮𝐛𝐞𝐫𝐧𝐞𝐭𝐞𝐬 𝐩𝐨𝐝 𝐤𝐞𝐩𝐭 𝐫𝐞𝐬𝐭𝐚𝐫𝐭𝐢𝐧𝐠 𝐞𝐯𝐞𝐫𝐲 𝐟𝐞𝐰 𝐡𝐨𝐮𝐫𝐬… 𝐚𝐧𝐝 𝐈 𝐡𝐚𝐝 𝐧𝐨 𝐜𝐥𝐮𝐞 𝐰𝐡𝐲. No errors in the logs. No crash messages. Everything looked normal. Still… the pod kept disappearing. 𝐎𝐮𝐭 𝐨𝐟 𝐜𝐮𝐫𝐢𝐨𝐬𝐢𝐭𝐲, 𝐈 𝐫𝐚𝐧: kubectl describe pod <pod-name> And found the real reason: 💥 𝐎𝐎𝐌𝐊𝐢𝐥𝐥𝐞𝐝 (𝐄𝐱𝐢𝐭 𝐂𝐨𝐝𝐞 137) That’s when it hit me, the application wasn’t crashing… Kubernetes was killing it due to memory exhaustion. 𝐇𝐞𝐫𝐞’𝐬 𝐰𝐡𝐚𝐭 𝐈 𝐢𝐝𝐞𝐧𝐭𝐢𝐟𝐢𝐞𝐝 👇 1️⃣ 𝐍𝐨 𝐦𝐞𝐦𝐨𝐫𝐲 𝐥𝐢𝐦𝐢𝐭𝐬 𝐝𝐞𝐟𝐢𝐧𝐞𝐝 The pod was allowed to consume unlimited memory. Eventually, it exhausted the node’s memory and got terminated. 👉 𝐅𝐢𝐱: 𝐀𝐥𝐰𝐚𝐲𝐬 𝐝𝐞𝐟𝐢𝐧𝐞 𝐫𝐞𝐬𝐨𝐮𝐫𝐜𝐞 𝐫𝐞𝐪𝐮𝐞𝐬𝐭𝐬 𝐚𝐧𝐝 𝐥𝐢𝐦𝐢𝐭𝐬 𝘳𝘦𝘴𝘰𝘶𝘳𝘤𝘦𝘴: 𝘳𝘦𝘲𝘶𝘦𝘴𝘵𝘴: 𝘮𝘦𝘮𝘰𝘳𝘺: "256𝘔𝘪" 𝘭𝘪𝘮𝘪𝘵𝘴: 𝘮𝘦𝘮𝘰𝘳𝘺: "512𝘔𝘪" 2️⃣ 𝐉𝐕𝐌 𝐰𝐚𝐬 𝐧𝐨𝐭 𝐜𝐨𝐧𝐭𝐚𝐢𝐧𝐞𝐫-𝐚𝐰𝐚𝐫𝐞 The Java application calculated heap size based on the node’s total memory, not the container limit. 👉 𝐅𝐢𝐱: 𝐓𝐮𝐧𝐞 𝐉𝐕𝐌 𝐟𝐨𝐫 𝐜𝐨𝐧𝐭𝐚𝐢𝐧𝐞𝐫 𝐞𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭𝐬 -𝘟𝘟:+𝘜𝘴𝘦𝘊𝘰𝘯𝘵𝘢𝘪𝘯𝘦𝘳𝘚𝘶𝘱𝘱𝘰𝘳𝘵 -𝘟𝘟:𝘔𝘢𝘹𝘙𝘈𝘔𝘗𝘦𝘳𝘤𝘦𝘯𝘵𝘢𝘨𝘦=75.0 3️⃣ 𝐌𝐞𝐦𝐨𝐫𝐲 𝐥𝐞𝐚𝐤 𝐢𝐧 𝐭𝐡𝐞 𝐚𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 Even after setting limits, memory usage kept increasing over time. Root cause: A background process was holding objects and not releasing them. 👉 Fix: Monitor memory trends using Prometheus and Grafana If memory steadily increases and doesn’t drop, it’s likely a memory leak. 💡 𝑲𝒆𝒚 𝒕𝒂𝒌𝒆𝒂𝒘𝒂𝒚𝒔: • Always define memory requests and limits • Make your application container-aware • Monitor trends, not just logs • OOMKilled = container terminated by the system, not an app crash This is one of the most common (and confusing) issues in Kubernetes. Have you faced something similar? 𝑾𝒐𝒖𝒍𝒅 𝒍𝒐𝒗𝒆 𝒕𝒐 𝒉𝒆𝒂𝒓 𝒉𝒐𝒘 𝒚𝒐𝒖 𝒅𝒆𝒃𝒖𝒈𝒈𝒆𝒅 𝒊𝒕 👇 #Kubernetes #DevOps #K8s #CloudNative #SRE #PlatformEngineering
Like Comment
To view or add a comment, sign in
Justin Taylor
4w Edited
Report this post
𝗪𝗵𝗮𝘁 𝗶𝗳 𝘆𝗼𝘂 𝗰𝗼𝘂𝗹𝗱 “𝗶𝗻𝘀𝘁𝗮𝗹𝗹” 𝗲𝘅𝗽𝗲𝗿𝘁𝗶𝘀𝗲 𝗶𝗻𝘁𝗼 𝘆𝗼𝘂𝗿 𝗰𝗼𝗱𝗲𝗯𝗮𝘀𝗲 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝘄𝗮𝘆 𝘆𝗼𝘂 𝗶𝗻𝘀𝘁𝗮𝗹𝗹 𝗮 𝗽𝗮𝗰𝗸𝗮𝗴𝗲? That’s the idea behind 𝘀𝗸𝗶𝗹𝗹𝘀.𝘀𝗵. Instead of building custom scripts or relying on scattered tools, you apply a focused skill that knows exactly what to look for and how to evaluate it. For .NET development, that opens up some really practical use cases: • Performance analysis across microservices • Identifying anti-patterns before they spread • Enforcing architectural consistency • Standardizing best practices across large portfolios • Giving teams faster, more consistent feedback I’ve been looking at the “𝗮𝗻𝗮𝗹𝘆𝘇𝗶𝗻𝗴-𝗱𝗼𝘁𝗻𝗲𝘁-𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲” skill and ran it against a microservice codebase. 𝗪𝗵𝗮𝘁 𝘀𝘁𝗼𝗼𝗱 𝗼𝘂𝘁: • It identifies Positive Patterns, which is something most tools overlook but is incredibly useful • It flags Critical, Medium, and Info-level findings so you can quickly prioritize • The insights are actionable and grounded in the code, not just generic advice • It gives a clear view of where performance risks may exist In a larger environment, this is where it gets interesting. You could run the same skill across dozens or hundreds of services and get consistent, repeatable insights without reinventing the wheel each time. It feels less like running tools and more like applying packaged expertise directly to your codebase. If you’re working in .NET and care about performance, this is worth checking out. https://lnkd.in/gkrSBdDk Curious how others would use installable skills across their engineering org. #dotnet #softwareengineering #devtools #developerexperience #performance #microservices #coding #programming #architecture #engineeringleadership
Like Comment
To view or add a comment, sign in
Vinesh Babu
2w
Report this post
Most systems don’t fail because of technology. They fail because of how we design them. In modern backend engineering, the real shift isn’t from monolith → microservices. It’s from code-centric thinking → system-centric thinking. A well-written service is not valuable if: • It can’t handle failure gracefully • It creates hidden coupling • It slows down deployments across teams Real scalability comes from design discipline, not just frameworks. Over time, I’ve realized a few things: → Microservices are not about splitting code — they’re about isolating failure domains → APIs are not interfaces they are contracts that outlive implementations → Event-driven systems are not just faster ,they are decoupled by design And most importantly: “The goal is not to build services that work. The goal is to build systems that continue to work when things break.” That’s where engineering maturity begins. #Java #Microservices #SystemDesign #CloudArchitecture #Kafka #BackendEngineering #DevOps
Like Comment
To view or add a comment, sign in
Proger ITo

101 followers
2w
Report this post
One of the biggest backend mistakes is treating complexity like a sign of progress. ⚙️ More layers. More abstractions. More tools. More patterns. It can look impressive. But strong engineering usually feels different: ✅ the flow is clear ✅ responsibilities are obvious ✅ failures are easier to trace ✅ changes are safer to make The goal is not to build something that looks advanced. The goal is to build something that stays understandable when real work begins. Because in software, complexity often grows by default. Clarity has to be designed on purpose. 🚀 #SoftwareEngineering #BackendDevelopment #SystemDesign #CleanArchitecture #DevOps
Like Comment
To view or add a comment, sign in
Muthukumaran Navaneethakrishnan
2w
Report this post
Every engineer who's shipped a real system has seen the 80% problem. APIs come together. The demo works. The system feels almost done. Then the second half begins — and in 1985, Tom Cargill at Bell Labs already had a name for it: "The first 90% of the code takes the first 90% of the time. The remaining 10% takes the other 90%." LLMs compressed that first 90%. An afternoon of prompting now produces what used to take a two-week sprint.The second 90% is untouched. Edge cases, hidden assumptions, the thing nobody asked about — that's still the work. The bottleneck used to be "can we build this?" Now it's "did we think about this deeply enough?" There's another layer most teams underestimate. A decision gets made — and then it moves. A developer explains it to another developer. Then to a manager. The CTO explains it to the CEO. Each retelling strips something — the tradeoff, the edge case, the assumption nobody wrote down. The wrong thing starts sounding right. LLMs give a clean answer. But it's framed in one voice. And one voice rarely survives every room it enters. So I built an agent skill Huddle. 21 agents who work with you in asking the questions you'd otherwise discover in production, or with your boss , also build with TDD , handle documentation , brainstorming and infra. #SoftwareEngineering #LLM #DeveloperTools #AgentSkills https://lnkd.in/gPf596xC

The 80% Problem — And What LLMs Actually Changed muthuishere.medium.com

7 Comments
Like Comment
To view or add a comment, sign in
Kuldeep Singh
3w
Report this post
In 2016, I mass-produced microservices like a factory. By 2017, I was debugging them at 2 AM on a Saturday. Here's what 14 years taught me about microservices the hard way: We had a monolith that "needed" to be broken up. So I split it into 23 microservices in 4 months. Result? - Deployment time went from 30 min to 3 hours - Debugging a single request meant checking 7 services - Team velocity dropped 40% - Every "simple" feature needed changes in 5+ repos The problem? I created a "distributed monolith." All the pain of microservices. None of the benefits. What I learned after fixing it: 1. Start with a well-structured monolith. Split only when you MUST. 2. Each service must own its data. Shared databases = shared pain. 3. If 2 services always deploy together, they should be 1 service. 4. Invest in observability BEFORE splitting. Tracing, logging, monitoring. 5. Domain boundaries matter more than tech stack choices. We consolidated 23 services down to 8. Deployment time dropped to 15 minutes. Team happiness went through the roof. The best architecture is the one your team can actually maintain. Have you ever over-engineered a system? What happened? #systemdesign #microservices #softwarearchitecture #java #programming
Like Comment
To view or add a comment, sign in
Rahul Singh
1w
Report this post
𝗠𝗶𝗰𝗿𝗼𝘀𝗲𝗿𝘃𝗶𝗰𝗲𝘀: 𝗣𝗼𝘄𝗲𝗿𝗳𝘂𝗹... 𝗯𝘂𝘁 𝗡𝗼𝘁 𝗙𝗿𝗲𝗲 Microservices is an architectural style where you break an application into small, independent services. 𝗘𝗮𝗰𝗵 𝘀𝗲𝗿𝘃𝗶𝗰𝗲: • Owns a single business capability • Has its own codebase and database • Can be deployed independently No more waiting for a coordinated release across teams. 🧩 𝗧𝗵𝗲 𝗖𝗼𝗿𝗲 𝗜𝗱𝗲𝗮: 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲 Your system becomes a collection of services: • Users (Java) • Orders (Go) • Payments (Kotlin) Each team: • Chooses its own tech stack • Scales independently • Deploys on its own schedule An API Gateway routes requests to the right service. ⚠️ 𝗧𝗵𝗲 𝗥𝗲𝗮𝗹𝗶𝘁𝘆: 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲 𝗛𝗮𝘀 𝗮 𝗖𝗼𝘀𝘁 Once you move to microservices, complexity shifts from code → system. 𝗡𝗼𝘄 𝘆𝗼𝘂’𝗿𝗲 𝗱𝗲𝗮𝗹𝗶𝗻𝗴 𝘄𝗶𝘁𝗵: • Network latency instead of in-process calls • Partial failures and retry logic • Service discovery as instances scale dynamically • Distributed tracing (correlation IDs) for debugging • Eventual consistency instead of ACID transactions Patterns like Saga become necessary to manage cross-service workflows. 🧠 𝗧𝗵𝗲 𝗛𝗼𝗻𝗲𝘀𝘁 𝗧𝗿𝘂𝘁𝗵 Microservices solve organizational scaling problems more than technical ones. They make sense when: • Multiple teams need to move independently • Systems are large enough to justify separation • Deployment velocity is a bottleneck 🚫 𝗪𝗵𝗲𝗻 𝗡𝗼𝘁 𝘁𝗼 𝗨𝘀𝗲 𝗧𝗵𝗲𝗺 If you’re a small team: Microservices will likely introduce more operational overhead than value. You’ll spend more time managing: • Infrastructure • Communication • Observability …than actually building features. ✅ 𝗔 𝗕𝗲𝘁𝘁𝗲𝗿 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵 Start with a monolith. Keep boundaries clean. Extract services only when you have a clear reason: • Scaling bottlenecks • Team ownership boundaries • Deployment constraints Not because it’s “modern.” 🎯 𝗙𝗶𝗻𝗮𝗹 𝗧𝗵𝗼𝘂𝗴𝗵𝘁 Architecture is about trade-offs, not trends. The best system isn’t the most distributed one. It’s the one your team can build, understand, and evolve efficiently. 💬 Have you seen microservices simplify or complicate your projects? 💾 𝗦𝗮𝘃𝗲 𝘁𝗵𝗶𝘀 𝗳𝗼𝗿 𝘀𝘆𝘀𝘁𝗲𝗺 𝗱𝗲𝘀𝗶𝗴𝗻 𝗱𝗶𝘀𝗰𝘂𝘀𝘀𝗶𝗼𝗻𝘀 ♻ 𝗥𝗲𝗽𝗼𝘀𝘁 𝘁𝗼 𝗵𝗲𝗹𝗽 𝗼𝘁𝗵𝗲𝗿 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 👥 𝗦𝗵𝗮𝗿𝗲 𝘄𝗶𝘁𝗵 𝘆𝗼𝘂𝗿 𝘁𝗲𝗮𝗺 #SoftwareEngineering #Microservices #SystemDesign #BackendDevelopment #Architecture #DistributedSystems #Programming #TechLeadership #Coding
Like Comment
To view or add a comment, sign in

1,516 followers

23 Posts

View Profile Connect

Retries Can Amplify Failures in Distributed Systems

More Relevant Posts

Explore related topics

Explore content categories