Partial Failure in Distributed Systems: Failing Gracefully with Fallbacks

Partial Failure (When Only Part of Your System Breaks) --- Built:- A service that aggregates data from multiple services: User service Order service Recommendation service All combined into one response. --- Problem I faced:- Everything worked fine… until one dependency started failing. Then: Entire API failed Even though other services were working Users saw errors for everything One small failure took down the whole response. --- What was really happening:- This was a partial failure. Only one service failed… but the system treated it like a full failure. * No isolation * No fallback * No graceful handling --- How I fixed it:- Instead of failing everything: Added fallback responses for optional services Marked some data as non-critical Used timeouts + circuit breakers Returned partial responses where possible Now: Core data always loads Optional features degrade gracefully System stays usable even during failures --- What I learned:- In distributed systems, failure is normal. The goal is not to avoid failure. It’s to limit its impact. --- Simple mental model:- If one feature breaks, the whole app shouldn’t feel broken. --- Carousel Breakdown :- Slide 1 → One service fails Slide 2 → Entire API fails Slide 3 → Identify partial failure Slide 4 → Add fallbacks Slide 5 → Return partial response Slide 6 → System stays usable --- Question::- If one dependency in your system goes down, does your API fail completely… or degrade gracefully? #Java #SpringBoot #Programming #SoftwareDevelopment #Cloud #AI #Coding #Learning #Tech #Technology #WebDevelopment #Microservices #API #Database #SpringFramework #Hibernate #MySQL #BackendDevelopment #CareerGrowth #ProfessionalDevelopment #RDBMS #PostgreSQL #backend

To view or add a comment, sign in

More Relevant Posts

Umar Ashraf Lone
1w
Report this post
Dead Letter Queue (When Messages Keep Failing Silently) --- Built:- A background system processing messages from a queue (orders, emails, events). --- Problem I faced:- Everything worked fine… until some messages started failing. Then: Same message kept retrying Logs kept growing Queue got slower Some messages were never processed successfully Worse part? Failures were getting buried in retries. --- What was really happening:- Messages were failing repeatedly with no exit path. Every retry pushed them back into the queue. They kept coming back… again and again. --System was stuck in a loop. --- How I fixed it:- Introduced a Dead Letter Queue (DLQ). Instead of retrying forever: Set a max retry limit After limit → move message to DLQ Logged and monitored failed messages Added manual or automated reprocessing Now: Queue stays clean Failures are isolated No infinite retry loops --- What I learned:- Not every message should be retried forever. Some failures need attention — not repetition. --- Simple mental model:- Think of DLQ like a “quarantine zone”. Healthy messages → processed normally Problematic messages → isolated for inspection --- Carousel Breakdown:- Slide 1 → Messages failing repeatedly Slide 2 → Infinite retries Slide 3 → Queue slowdown Slide 4 → Introduce DLQ Slide 5 → Move failed messages Slide 6 → Inspect & reprocess --- Question In your system, what happens to messages that keep failing… do they stop somewhere, or retry forever? #Java #SpringBoot #Programming #SoftwareDevelopment #Cloud #AI #Coding #Learning #Tech #Technology #WebDevelopment #Microservices #API #Database #SpringFramework #Hibernate #MySQL #BackendDevelopment #CareerGrowth #ProfessionalDevelopment #RDBMS #PostgreSQL #backend
Like Comment
To view or add a comment, sign in
Shivanshu Raj
1w Edited
Report this post
Built a production-grade backend from scratch — here's what I learned. TaskAlloc is an employee and task allocation REST API I built with FastAPI and PostgreSQL. Not a tutorial follow-along — I designed the architecture, made the decisions, and figured out why things break. What's under the hood: → 3-tier role system (Admin / Manager / Employee) with access enforced at the query layer — not just filtered in the response → JWT auth with refresh token rotation. Raw tokens never touch the database, only SHA-256 hashes are stored. If the DB leaks, the tokens are useless. → Task state machine — PENDING → IN_PROGRESS → UNDER_REVIEW → COMPLETED. Invalid transitions are rejected before any database write. → Middleware that auto-logs every mutating request with who did it, what resource they touched, and the HTTP status code → 67 passing tests against SQLite in-memory. No external database needed to run the suite. 35+ endpoints. Soft delete. UUID primary keys. Docker + Docker Compose. Full Swagger docs. The thing that surprised me most was how much I learned from just trying to do things the right way — not "make it work" but "make it work correctly." Things like why audit logs shouldn't have a foreign key to users, or why you write the activity log before the status update commits. GitHub in the comments. #FastAPI #Python #BackendDevelopment #PostgreSQL #SoftwareEngineering #BuildingInPublic #OpenToOpportunities #Development

11 Comments
Like Comment
To view or add a comment, sign in
Umar Ashraf Lone
2w
Report this post
Sometimes your system isn’t slow because of heavy logic. It’s slow because it’s waiting. Waiting for: another service a database an external API And while it waits, threads just sit there doing nothing. --- This is where Async Processing helps The idea is simple: Don’t block. Do the work later. --- What this looks like Instead of doing everything in one request: User places an order System saves order immediately Email is sent later Notification is processed in background The user doesn’t wait for everything. --- How it’s usually done Background jobs Message queues (Kafka, RabbitMQ) @Async in Spring Boot You move non-critical work out of the main flow. --- Why this matters Without async: Requests take longer Threads stay blocked System struggles under load With async: Faster response times Better scalability Smoother user experience --- Real-world example When you upload a file: You don’t wait for processing You get a response quickly Processing happens in background --- Trade-offs Async adds complexity: Harder to debug Requires retry handling Failures are not immediate --- Simple takeaway Not everything needs to happen right now. --- If your system is slow, how much of that work actually needs to be done synchronously? #Java #SpringBoot #Programming #SoftwareDevelopment #Cloud #AI #Coding #Learning #Tech #Technology #WebDevelopment #Microservices #API #Database #SpringFramework #Hibernate #MySQL #BackendDevelopment #CareerGrowth #ProfessionalDevelopment #RDBMS #PostgreSQL #backend
Like Comment
To view or add a comment, sign in
Shivshankar Jha
3w
Report this post
An API issue that wasn’t actually an API problem… Recently, I faced a production issue where an API response was taking much longer than expected. At first, everything pointed towards the API layer: - Code looked clean - No exceptions - No obvious bottlenecks But users were still experiencing delays… So I started digging deeper 👇 ✔️ Checked API logs → all good ✔️ Verified business logic → no issue ✔️ Then analyzed database queries And that’s where the real problem was. A poorly optimized SQL query was slowing down the entire response. Even though the API was working perfectly, it was dependent on inefficient data retrieval. 🔍 Fix: - Optimized the SQL query - Added proper indexing - Removed unnecessary joins ⚡ Result: Response time improved drastically 🚀 💡 Lesson: In backend systems, API performance is only as good as your database performance. Working on real systems taught me one thing: Issues are rarely where they appear. Still learning and improving every day as a backend engineer 💻 👉 Have you faced a situation where the issue was somewhere else than expected? #BackendDevelopment #DotNet #SQL #API #SoftwareEngineering #ProductionIssues #TechLearning
3 Comments
Like Comment
To view or add a comment, sign in
Umar Ashraf Lone
1w Edited
Report this post
Timeouts (The Small Setting That Saves Your System) --- Built:- A service calling multiple downstream APIs to fetch and aggregate data. --- Problem I faced:- Everything worked fine… until one dependency slowed down. Then suddenly: Requests started hanging Thread pool got exhausted API response time shot up Entire service became slow All because one service was taking too long. --- How I fixed it:- The issue was missing timeouts. Requests were waiting indefinitely. Fixes applied: Added strict timeouts for all external calls Used fallback responses where possible Combined with circuit breaker for failing services Monitored slow calls with proper logging Now: Slow services don’t block everything System fails fast instead of hanging Overall stability improved --- What I learned A slow dependency is sometimes worse than a failed one. At least failures are quick. Slow calls quietly kill your system. --- Question:- Do your API calls have proper timeouts… or are they waiting forever without you noticing? #Java #SpringBoot #Programming #SoftwareDevelopment #Cloud #AI #Coding #Learning #Tech #Technology #WebDevelopment #Microservices #API #Database #SpringFramework #Hibernate #MySQL #BackendDevelopment #CareerGrowth #ProfessionalDevelopment #RDBMS #PostgreSQL #backend
Like Comment
To view or add a comment, sign in
Ravindranath Porandla
2w
Report this post
“Backend is dead.” But that statement is a bit misleading. In many modern apps, especially CRUD-heavy ones, some responsibilities traditionally handled by the backend can actually be moved into the database. One interesting approach is using Row-Level Security (RLS). Instead of writing authorization logic in every API endpoint, the database itself can enforce rules like: Users can only read their own data Users can only update rows they own Unauthorized rows never get returned This shifts part of the security logic closer to the data itself. In this article, I explored: • Why direct frontend → database access is normally dangerous • How Row-Level Security (RLS) changes that equation • How to design policies for real applications (SELECT, INSERT, UPDATE, DELETE) • What people often miss in production setups • And when you still absolutely need a backend The backend isn’t dead — but the way we design it is evolving. 📖 Read the article: https://lnkd.in/giwefsnH I’m also documenting what I’m learning along the way. Backend Engineering from First Principles: https://lnkd.in/g7MJ6TnP System Design from First Principles (HLD + LLD): https://lnkd.in/gateH6Jz #BackendEngineering #SystemDesign #DatabaseDesign #PostgreSQL #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Aditya Prasad
4w
Report this post
Ever had a system challenge you not because of faulty logic, but because of the way it was built to handle scale? I recently ran into a few interesting issues while developing a new data sourcing job, I was working with a cursor-based paginated API that returned data in batches of 500 records per request and the final response was huge and the learnings were too valuable not to share. 🔴𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝟭: 𝗢𝘂𝘁𝗢𝗳𝗠𝗲𝗺𝗼𝗿𝘆𝗘𝗿𝗿𝗼𝗿 𝗳𝗿𝗼𝗺 𝗦𝘁𝗿𝗶𝗻𝗴𝗕𝘂𝗶𝗹𝗱𝗲𝗿 The application crashed with: java.lang.OutOfMemoryError at AbstractStringBuilder.hugeCapacity() 👉 Root cause: After getting the final response there was a need of JSON transformation by appending "\n" after every obejct, which resulted in (~GBs) in memory. ✅ Fix: Switched to chunk-based processing (~1GB chunks) instead of accumulating everything in memory and then writing through azure process into lake. 🟠 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝟮: 𝗚𝗖 𝗢𝘃𝗲𝗿𝗵𝗲𝗮𝗱 𝗟𝗶𝗺𝗶𝘁 𝗘𝘅𝗰𝗲𝗲𝗱𝗲𝗱 Frequent GC pauses due to large in-memory lists holding paginated API data. 👉 Root cause: Storing entire dataset in lists before DB insertion. ✅ Fix: Inserted data in batches and cleared lists after each batch, reducing memory pressure significantly. 🔵 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝟯: 𝗣𝗿𝗲𝗺𝗮𝘁𝘂𝗿𝗲 𝗔𝗣𝗜 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻 𝗖𝗹𝗼𝘀𝘂𝗿𝗲 Facing incomplete responses due to connection drops which resulted into JSON mapping failure into our data model. 👉 Root cause: Default HTTP socket timeout too low for large payloads. ✅ Fix: Increased socket timeout to as per the requirement, ensuring full data retrieval. 💡 Key Takeaways: Never build massive objects in memory — stream or chunk your data. Always release memory proactively in long-running processes. Tune timeouts and resource configs based on real workload, not defaults. These issues reinforced an important lesson: 👉 Efficient memory and resource management is as critical as writing correct logic and even basics of CSE101 comes into picture. #systemdesign #scaling #java #api #development #cloud #datalake

2 Comments
Like Comment
To view or add a comment, sign in
Sk Nazibul Hossain
1mo
Report this post
🚀 𝐁𝐮𝐢𝐥𝐭 𝐚 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐒𝐭𝐲𝐥𝐞 𝐁𝐚𝐜𝐤𝐞𝐧𝐝 𝐢𝐧 𝐆𝐨 𝐰𝐢𝐭𝐡 𝐏𝐨𝐬𝐭𝐠𝐫𝐞𝐒𝐐𝐋 (𝐃𝐨𝐜𝐤𝐞𝐫𝐢𝐳𝐞𝐝) Recently I moved from an in-memory store to a real DB and went beyond basic DB connectivity to build a near production-style backend service in Go. 🔧 𝐓𝐞𝐜𝐡 𝐒𝐭𝐚𝐜𝐤: Go (net/http, database/sql) PostgreSQL (running via Docker) REST API 🧱 𝐖𝐡𝐚𝐭 𝐈 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝: 🔹 Dockerized Database Ran PostgreSQL using Docker 🔹 Clean Architecture Structured project as: main → handler → store → database Separated HTTP logic from database layer 🔹 Database Layer Used INSERT, RETURNING id for efficient writes Implemented: QueryRow for single-row queries Query for multi-row queries 🔹 Production Practices Context-aware DB calls (context.WithTimeout) Connection pooling (SetMaxOpenConns, etc.) Proper error handling (avoiding log.Fatal in business logic) 🔹 API Endpoints POST /products → create product GET /products → fetch all products 💡 𝐊𝐞𝐲 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠𝐬: Difference between driver vs database/sql Why RETURNING matters in PostgreSQL How real backend services are structured 📈 𝐖𝐡𝐚𝐭’𝐬 𝐧𝐞𝐱𝐭: Transactions (for real-world scenarios like payments) Exploring pgx for high-performance database access This project helped me bridge the gap between “it works” and “it’s production-ready.” #golang #postgresql #docker #backend #softwareengineering #learninginpublic
Like Comment
To view or add a comment, sign in
Mohitt Chopra
1mo
Report this post
Day 14. My API worked perfectly. Until it hit 100 users. I had this: List<Order> orders = orderRepository.findAll(); for (Order order : orders) { System.out.println(order.getUser().getName()); } Because of queries I didn’t even know existed. Looks clean. It’s not. Here’s what was actually happening: → 1 query to fetch all orders → 1 query per order to fetch the user → 100 orders = 101 queries hitting your database That’s the N+1 problem. And it hides in plain sight. Your code looks clean. Your database is suffering. And you probably have this in your codebase right now. The fix is simple: Fetch what you need. In one query. @Query("SELECT o FROM Order o JOIN FETCH o.user") List<Order> findAllWithUser(); One query. One JOIN. Everything loaded together. What actually changes: → Performance — 101 queries becomes 1 → Scalability — works at 100 rows, breaks at 100,000 → Visibility — you won’t notice until production The hard truth: → ORMs make it easy to write slow code → Lazy loading is convenient until it isn’t → You need to know what SQL your code is generating Writing code that works is easy. Writing code that doesn’t silently destroy your database is the real skill. Are you logging your SQL queries in development? If not — you should be. 👇 Drop it below #SpringBoot #Java #Hibernate #BackendDevelopment #JavaDeveloper

1 Comment
Like Comment
To view or add a comment, sign in
Dylan Bitencourt Gonçalves
2w
Report this post
Stop exposing your Database Entities! 🛑 Why the DTO Pattern is non-negotiable for Spring Boot developers. Most developers learn this the hard way. Your database schema and your API contract are two completely different things. If you're returning @Document or @Entity classes directly from your REST controllers, you're opening a Pandora's box of security leaks, tight coupling, and maintenance pain. The fix is the DTO (Data Transfer Object) Pattern. 🛡️ Think of a DTO as a bouncer: it decides exactly what data gets in — and what gets out. 🚀 4 reasons you should care: 1️⃣ Security: Your User entity holds a passwordHash and sensitive internal fields. A DTO ensures you never accidentally expose them. 2️⃣ Decoupling: Renamed a MongoDB field? Update the mapper — your API contract stays untouched. No breaking changes for your clients. 3️⃣ Validation: Use @Email, @NotBlank, @Size on your DTOs to stop bad data before it ever reaches your service layer. 4️⃣ Performance: Stop serializing 50-field documents when the client only needs 3. Shape your payload. Ship less data. 🔄 Map with MapStruct — compile-time, type-safe, zero reflection and is widely used. The DTO pattern isn't overhead. It's discipline — and in production, it pays for itself fast. Still returning Entities directly? Or already using DTOs? Tell me in the comments 👇 #Java #SpringBoot #MongoDB #BackendDevelopment #SoftwareArchitecture #DTO #CleanCode #APIDesign #CodingTips
3 Comments
Like Comment
To view or add a comment, sign in

882 followers

536 Posts

View Profile Follow

Partial Failure in Distributed Systems: Failing Gracefully with Fallbacks

More Relevant Posts

Explore content categories