Bad Architecture Causes More Production Pain Than High Traffic

Hot take for backend engineers: Most teams do not have a scaling problem. They have a design problem. When a system slows down, the first reaction is usually: add retries add more pods add caching add a queue split another service That feels like engineering. But a lot of the time, the real issue is simpler: chatty service-to-service calls bad timeout values no backpressure weak DB access patterns too many synchronous dependencies in one request path I’ve seen systems with moderate traffic behave like they were under massive load. Not because traffic was insane. Because the architecture was burning resources on every request. That’s why “we need to scale” is often the wrong diagnosis. Sometimes the system does not need more infrastructure. It needs fewer moving parts. Debate: What causes more production pain in real systems? A) high traffic B) bad architecture C) poor database design D) weak observability My vote: B first, C second. What’s yours? #Java #SpringBoot #Microservices #DistributedSystems #BackendEngineering

2 Comments

Dhushyanth Reddy 3w

My rule now: Before adding retries, queues, or more services, I ask one question: Did we actually fix the bottleneck, or did we just spread it around?

1 Reaction

Izabel Gershanik 3d

there are two principal that people often forget A - good code is: no lines can be removed or need to be added. B - you evaluated by the impact of your code not amount of it

See more comments

To view or add a comment, sign in

More Relevant Posts

Ezhil Karthikeyan
3w
Report this post
While working through backend scalability issues, I’ve been spending more time thinking about how systems behave once things stop going as expected. A lot of designs look fine until: • consumers start lagging • retries pile up • duplicate events show up • downstream services slow down • one bad message starts affecting the pipeline That is usually where the real backend work begins. Concepts like backpressure, idempotency, and DLQ sound simple on paper, but they become very real once systems are under load or dependencies start failing. Over time, one thing has become clearer to me: A reliable system is not one that avoids failure. It is one that can absorb failure without losing correctness. That is where a lot of backend engineering really lives - not just in building features, but in designing systems that can safely handle retry, delay, duplication, and partial failure. Still learning, but spending more time appreciating the engineering behind resilient distributed systems. #Java #SpringBoot #BackendEngineering #DistributedSystems #SystemDesign #Microservices #Kafka #RabbitMQ #ScalableSystems #SoftwareEngineering #EventDrivenArchitecture
Like Comment
To view or add a comment, sign in
Harsh P Parnerkar
1w
Report this post
Lessons from Real Backend Systems Short reflections from building and maintaining real backend systems — focusing on Java, distributed systems, and the tradeoffs we don’t talk about enough. ⸻ We had logs everywhere. Still couldn’t explain the outage. At first, it didn’t make sense. Every service was logging. Errors were captured. Dashboards were green just minutes before the failure. But when the system broke, the answers weren’t there. What we had: [Service A Logs] [Service B Logs] [Service C Logs] What we needed: End-to-end understanding of a single request The issue wasn’t lack of data. It was lack of context. Logs told us what happened inside each service. They didn’t tell us how a request moved across the system. That’s when we realized: Observability is not about collecting signals. It’s about connecting them. At scale, debugging requires three perspectives working together: Logs → What happened? Metrics → When and how often? Traces → Where did it happen across services? Without correlation, each signal is incomplete. The turning point was introducing trace context propagation. [Request ID / Trace ID] ↓ Flows across all services ↓ Reconstruct full execution path Now, instead of guessing: * We could trace a failing request across services * Identify latency bottlenecks precisely * Understand failure propagation Architectural insight: Observability should be designed alongside the system — not added after incidents. If you cannot explain how a request flows through your system, you cannot reliably debug it. Takeaway: Logs help you inspect components. Observability helps you understand systems. Which signal do you rely on most during incidents — logs, metrics, or traces? — Writing weekly about backend systems, architectural tradeoffs, and lessons learned through production systems. Keywords: #Observability #DistributedSystems #SystemDesign #BackendEngineering #SoftwareArchitecture #Microservices #Tracing #Monitoring #ScalableSystems
Like Comment
To view or add a comment, sign in
Pradip Nair
1w Edited
Report this post
Over the past few months, our team has been facing a reality many engineering teams know well: frequent performance incidents, daily escalations, and growing technical debt in a backend that was never designed to handle heavy load. #Java #BackendDevelopment #SystemDesign #PerformanceEngineering #SQL #Scalability #SoftwareArchitecture #TechLeadership

1 Comment
Like Comment
To view or add a comment, sign in
Vaddi Manoj Kumar
4w Edited
Report this post
🚀 Day 10/45 – Backend Engineering (Concurrency) Today I focused on how concurrent requests impact backend systems. 💡 What I learned: 🔹 Problem: Multiple requests accessing/modifying shared data can lead to: * Inconsistent data * Race conditions * Hard-to-debug issues --- 🔹 Example: Two users updating the same record at the same time 👉 Final state becomes unpredictable ❌ --- 🔹 Solutions: * Synchronization (use carefully) * Locks (ReentrantLock) * Optimistic locking (versioning in DB) * Avoid shared mutable state --- 🔹 In real backend systems: * APIs are hit concurrently * Thread safety is critical * Poor handling = production bugs --- 🛠 Practical: Explored how concurrent updates affect data consistency and how locking strategies help maintain integrity. --- 📌 Real-world impact: Proper concurrency handling: * Prevents data corruption * Ensures consistency * Makes systems reliable under load https://lnkd.in/gJqEuQQs #Java #BackendDevelopment #Concurrency #Multithreading #SystemDesign
Like Comment
To view or add a comment, sign in
Kishan Kumar
2w
Report this post
🚨 Real-Time Production Questions Every Backend Developer Should Be Ready For Building APIs is one thing… Handling production issues at 2 AM? That’s where real backend engineering begins 💥 If you’re working with technologies like Spring Boot, here are some real-world production scenarios you must be prepared for: ⸻ 🔥 1. “Why is my API suddenly slow?” • Check DB queries (slow queries, missing indexes) • Thread pool exhaustion • External service latency • Enable logs + monitoring (Actuator, APM tools) ⸻ 🔥 2. “Why are users getting 500 errors?” • Unhandled exceptions • Null pointer issues • Downstream service failure 👉 Always implement global exception handling ⸻ 🔥 3. “Why is the system crashing under load?” • Memory leaks (heap dump analysis) • High CPU usage • Improper connection pooling 👉 Load testing is not optional! ⸻ 🔥 4. “Data inconsistency in production?” • Missing transactions • Concurrent updates • Race conditions 👉 Use proper isolation levels & locking mechanisms ⸻ 🔥 5. “Why are messages not being processed?” • Kafka/RabbitMQ consumer lag • Offset mismanagement • Dead letter queues ignored ⸻ 💡 What I learned from production: ✔️ Logs are your best friend ✔️ Monitoring > Debugging ✔️ Always design for failure ✔️ Never assume “it won’t happen” ✔️ Write code like you’ll support it in production ⸻ 🎯 Final Thought: Anyone can write code that works… But a true backend developer writes systems that survive production 🚀 ⸻ 💬 What’s the toughest production issue you’ve faced? #BackendDevelopment #SpringBoot #Java #Microservices #ProductionIssues #SoftwareEngineering #Developers
Like Comment
To view or add a comment, sign in
Wajahat Arshad
5d
Report this post
One of the most fragile parts of any backend system is depending on external APIs. We learned this the hard way. We were integrating 3 third-party services into our platform. Payments. Notifications. Data providers. All called synchronously, one after another. The result? If any of those providers lagged even slightly, our entire API froze. Users waited. Requests piled up. The server choked. So we rethought the architecture completely. Here is what changed: Instead of calling third-party APIs directly from the request cycle, we offloaded those calls to background jobs using BullMQ The main server now just queues the job and immediately returns a response to the client A background worker handles the actual API call separately If the external service fails or times out, the job does not disappear. It gets pushed into a retry queue with exponential backoff and tries again automatically The result? A 70% drop in failure rates. The biggest mindset shift for me was this: Stop assuming your code will not fail. Start assuming the network will always fail at some point, and design your system to handle it gracefully. Synchronous = tightly coupled = one failure breaks everything Async + queues = decoupled = failures become recoverable events This is not premature optimization. This is just building systems that survive the real world. #Backend #SystemDesign #SoftwareEngineering #WebDevelopment #Architecture #NodeJS #BullMQ #API #DistributedSystems #Engineering #Tech #Programming #SoftwareDevelopment #BackendDevelopment #DevOps #Resilience #CloudComputing #Microservices #CodingLife #BuildInPublic
Like Comment
To view or add a comment, sign in
Vasyl Oliinyk
1mo
Report this post
Bulk inserting 800,000 rows is easy. Doing it without breaking everything isn’t. I’m a frontend engineer. But large datasets force you to think like a systems engineer. I was building a project to experiment with large datasets and pulled 800k+ earthquake records from CSV. The naive solution? Insert everything at once. But I knew what would happen: ❌ DB locks ❌ Memory spikes ❌ Blocked event loop ❌ Unstable server So I designed it differently from the start. Instead of 800,000 inserts, I split the data into chunks of 5,000. Each chunk became a job in BullMQ. Queue → process → insert → done. Controlled load. Retries. Failure tracking. No silent errors. Nothing crashed. Because it was never allowed to. The real lesson wasn’t about queues. It was this: Engineering maturity is thinking about failure before failure happens. And being “frontend” doesn’t mean you stop thinking about throughput, batching, and backpressure. Scale isn’t a backend problem. It’s a systems problem. How do you process large datasets without overwhelming your system?
1 Comment
Like Comment
To view or add a comment, sign in
Krishna Koushik Unnam
2w
Report this post
One of the biggest transitions from mid-level backend engineering to senior backend engineering is realizing that Garbage Collection is not just a JVM internals topic — it is a latency, scalability, and reliability topic. In modern backend systems, every API request creates temporary objects: • Request/response DTOs • JSON serialization objects • Hibernate entities and proxies • Validation objects • Thread-local allocations • Logging and tracing metadata At scale, this means millions of short-lived objects are created every minute. The JVM handles cleanup automatically, but the way it performs that cleanup can directly affect production behavior. In distributed systems, even a 200ms GC pause can amplify into: • Elevated API latency • Timeout failures • Retry storms from load balancers • Thread pool saturation • Cascading downstream failures This is why GC selection is not just a JVM decision — it is an architecture decision. My practical view of the three most important modern collectors: • G1 GC → Best starting point for most Spring Boot and microservice workloads. Strong balance between throughput and predictable pause times. • ZGC → Ideal for ultra-low latency systems where pause times need to stay consistently low, even with very large heaps. • Shenandoah → Valuable for Kubernetes and cloud-native environments where workload patterns and heap pressure change rapidly. One of the biggest mistakes engineers make is assuming lower pause times always mean better performance. That is not always true. Choosing ZGC for a smaller service can increase CPU usage without delivering meaningful latency improvements. In many cases, G1 gives better overall efficiency because the workload does not justify the extra GC overhead. On one service I worked on, moving from Parallel GC to G1 reduced p99 latency spikes during peak traffic by more than 40%. Garbage Collection also cannot solve poor memory hygiene. Static cache growth, ThreadLocal misuse, unbounded collections, and object retention issues will still create memory pressure regardless of the collector you choose. Senior engineers do not just write code that works. They understand how the JVM behaves under real production traffic, how memory pressure affects latency, and how infrastructure decisions shape end-user experience. #Java #JVM #GarbageCollection #SpringBoot #Microservices #BackendEngineering #DistributedSystems #PerformanceEngineering #Scalability #SystemDesign
Like Comment
To view or add a comment, sign in
Ashwina Mathur
1mo
Report this post
API / BACKEND ENGINEERING (REAL-WORLD) Top 20 API & Backend Interview Questions → What is REST API? → Difference between REST and SOAP? → What is API authentication? → What is JWT (JSON Web Token)? → What are HTTP methods (GET, POST, PUT, DELETE)? → What is rate limiting? → What is API versioning? → What is idempotency? → What is middleware in backend? → How do you design scalable APIs? → What is database indexing? → What is caching in backend? → What is microservices architecture? → What is load balancing? → How do you handle concurrency? → What is event-driven architecture? → How do you secure APIs? → What is GraphQL? → How do you handle errors in APIs? → Best practices for backend development? Follow : Ashwina Mathur
Like Comment
To view or add a comment, sign in
Dhushyanth Reddy
1w
Report this post
The service was up. The API was up. The database was up. Users were still complaining. That’s why I think a lot of engineers misunderstand reliability. They think reliability means keeping components alive. It doesn’t. It means keeping the full user path stable when systems start interacting badly. A service can be healthy. A dashboard can be green. And the product can still feel broken. Because real pain usually comes from: - slow dependencies - retry amplification - fragile request paths - components that are “up” but harmful together Debate: What do engineers misunderstand more often? A) scalability B) reliability My vote: B. What’s yours? #Java #SpringBoot #DistributedSystems #Microservices #BackendEngineering

1 Comment
Like Comment
To view or add a comment, sign in

4,746 followers

37 Posts

View Profile Follow

Bad Architecture Causes More Production Pain Than High Traffic

More Relevant Posts

Explore content categories