Observability Trumps Assumptions in Production

1mo

Why I trust Observability more than Assumptions. One lesson that became more obvious to me over time is this, The system we describe in design discussions is usually much cleaner than the system that actually exists in production. As engineers, we naturally make assumptions. • We assume a query will be fast. • We assume a dependency is not the bottleneck. • We assume users will follow the expected flow. • We assume a timeout value is probably enough. • We assume the issue is where we first noticed the symptom. Assumptions help us start. But in production, they are not enough. What changed my thinking was realizing how often production tells a different story. • Sometimes we assume the database is slow, but the real latency is coming from a downstream service. • Sometimes we think a feature issue is a frontend problem, but the actual cause is inconsistent data or a failing dependency. • Sometimes we believe a rollout is safe, until real traffic, retries, and edge cases prove otherwise. That is why I trust Observability more. Not because Assumptions have no value, but because Observability gives something better evidence. Good Observability is not just “having logs.” It is being able to understand what is happening from the outside through meaningful logs, useful metrics, traces across flows, clear alerts, and enough context to debug without guessing. And the more I worked on production systems, the more I realized this observability does not just help during incidents. It changes how you build. • You design better error handling. • You add better context to logs. • You think more carefully about failure paths. • You make rollouts safer. • You reduce the time between “something is wrong” and “we know why.” For me, that is one of the biggest shifts in engineering maturity. Assumptions help us move fast. Observability helps us move correctly. In production, confidence is useful. Evidence is better. #SoftwareEngineering #Observability #BackendEngineering #SystemDesign #ProductionEngineering #Java #SpringBoot #SRE

To view or add a comment, sign in

More Relevant Posts

Srikant Mahanty
5d
Report this post
Everyone is building fast with Spring Boot… but silently killing their architecture with one keyword 👇 👉 `new` Looks harmless right? But this one line can destroy **scalability, testability, and flexibility**. --- ## ⚠️ Real Problem You write this: ```java @Service public class OrderService { private PaymentService paymentService = new PaymentService(); // ❌ } ``` Works fine. No errors. Ship it 🚀 But now ask yourself: * Can I mock this in testing? ❌ * Can I switch implementation easily? ❌ * Is Spring managing this object? ❌ You just bypassed **Dependency Injection** completely. --- ## 💥 What actually went wrong? You created **tight coupling**. Now your code is: * Hard to test * Hard to extend * Painful to maintain --- ## ✅ Correct Way (Production Mindset) ```java @Service public class OrderService { private final PaymentService paymentService; public OrderService(PaymentService paymentService) { this.paymentService = paymentService; } } ``` Now: ✔ Loose coupling ✔ Easy mocking ✔ Fully Spring-managed lifecycle --- ## 🚨 Sneaky Mistake (Most devs miss this) ```java public void process() { PaymentService ps = new PaymentService(); // ❌ hidden violation } ``` Even inside methods — it’s still breaking DI. --- ## 🧠 Where `new` is ACTUALLY OK ✔ DTO / POJO ✔ Utility classes ✔ Builders / Factory pattern --- ## ❌ Where it’s NOT OK ✖ Services ✖ Repositories ✖ External API clients ✖ Anything with business logic --- ## ⚡ Reality Check In the AI era, anyone can generate working code. But production-ready engineers ask: 👉 “Who is managing this object?” 👉 “Can I replace this tomorrow?” 👉 “Can I test this in isolation?” --- ## 🔥 One Line to Remember > If you are using `new` inside a Spring-managed class… > you are probably breaking Dependency Injection. --- Stop writing code that just works. Start writing code that survives. #Java #SpringBoot #CleanCode #SystemDesign #Backend #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Srikant Mahanty
3w
Report this post
“Understanding definition ≠ ability to implement cleanly in real code.” This line hit me hard today. We often feel confident after: ✔️ Reading docs ✔️ Watching tutorials ✔️ Understanding concepts But the real test begins when you **write actual code**. 🔍 What I realized I knew `Optional`, `map`, `flatMap` — conceptually clear. But during a simple login implementation: → Fetch user → Extract email I ended up with: ```java Optional<Optional<String>> ``` That moment exposed the gap: 👉 I understood the *definition* 👉 But not the *application* --- ### ⚠️ The Hidden Trap ```java findUserById(id) .map(user -> Optional.ofNullable(user.getEmail())) ``` This creates nesting. Code works… but design becomes messy. ✅ The Clean Way ```java findUserById(id) .flatMap(user -> Optional.ofNullable(user.getEmail())) ``` Now it’s: ✔️ Clean ✔️ Composable ✔️ Readable 🧠 The Real Learning Concepts are just the starting point. Real growth happens when: * You hit confusion * You debug deeply * You refine your thinking 🚀 Takeaway Don’t stop at *“I understand this”* Push until you can say: 👉 *“I can apply this cleanly in production code.”* Because in engineering: Clarity in thinking → Simplicity in code #Java #SpringBoot #CleanCode #BackendDevelopment #LearningJourney
Like Comment
To view or add a comment, sign in
Vishu Kalier
3w
Report this post
Most systems don’t fail because of complexity—they fail because of inconsistency. When every API speaks a different language, debugging becomes guesswork and scaling becomes chaos. In microservices architectures used by Netflix, Amazon and many more, response standardization is a foundational design decision, not just a coding preference. As shown in the architecture, each endpoint returns a common base response while extending it for specific needs . This ensures uniform communication across layers without sacrificing flexibility. Here’s how standardization is achieved and why it matters: • Define a base response model (e.g., success flag, message) shared across all endpoints • Extend it using inheritance or composition to include endpoint-specific data (userID, conversationID, lists) • Enforce consistent response structure at the endpoint layer, regardless of internal logic • Separate concerns by keeping response shaping independent from business logic Its not just about consistent response and SOLID principles, the benfits are astounding making complex systems simple for end users (abstraction at scale)- • Predictable API contracts → easier frontend integration • Faster debugging → uniform error handling and logs • Reduced duplication → centralized response structure • Scalability → new features plug into an existing contract seamlessly In essence, standardized responses act as a contract of trust between services and consumers, enabling systems to evolve without breaking. How do you ensure consistency in your APIs as systems grow in complexity? Let’s talk about your way to standardize API designs. Follow Vishu Kalier for more such architectural deep dives about System Design and real world systems. #SystemDesign #Microservices #BackendEngineering #APIDesign #SpringBoot #SoftwareArchitecture #ScalableSystems #Java #DesignPatterns

1 Comment
Like Comment
To view or add a comment, sign in
Arjun Iyer
1w
Report this post
100+ developers. 3+ coding agents per developer running in parallel with tools like Cursor and Claude. Each agent submitting multiple PRs per day. Now add a microservices dependency graph of 50+ services. Try to give each of those PRs a full-stack copy of your environment for validation. Duplicating 50 services per PR isn't a cost problem. It's a physics problem. Spin-up time alone kills the feedback loop agents need to iterate. The answer isn't more staging environments or longer queues. It's deploying only the changed services into lightweight isolated environments that share baseline dependencies. Fast. Parallel. On demand. Full-stack replication can't survive the collision of enterprise microservices and agent-scale concurrency. The math doesn't work. It never will.
Like Comment
To view or add a comment, sign in
Muhammad Wasif
3w
Report this post
Week 2 Recap — 7 Concepts That Actually Matter in Real-World Systems Two weeks in. 7 concepts. And every single one solves a real production problem 👇 Let’s break it down: 🔹 1. Backend internals most devs misunderstand @Transactional is a proxy — not magic Internal method calls bypass it. Private methods don’t trigger it. That “random” data inconsistency bug? This is often why. Angular Change Detection (OnPush) Default strategy checks everything on every interaction. Switch to OnPush + immutability + async pipe → ~94% fewer checks. 👉 This is the difference between “it works” and “it scales.” 🔹 2. Data & security fundamentals at scale Database Indexing Without index → full table scan (millions of rows) With index → milliseconds Same query. Completely different system behavior. JWT Reality Check JWT ≠ encryption It’s just Base64 encoded → anyone can read it Use httpOnly cookies, short expiry, refresh tokens And never put sensitive data inside 👉 Most performance issues and auth bugs come from ignoring these basics. 🔹 3. Distributed systems patterns that save you in production Node.js Streams Loading a 2GB file into memory = server crash Streams process chunk by chunk (~64KB) Bonus: built-in backpressure handling SAGA Pattern You can’t rollback across microservices So you design compensating actions instead Every service knows how to undo itself 👉 Distributed systems don’t fail if — they fail how. These patterns handle that. 🔹 4. Architecture that simplifies everything API Gateway One entry point for all clients Centralized auth, logging, rate limiting Aggregates multiple calls into one 👉 Cleaner clients. Safer backend. More control. 📊 What this looks like in the real world: 8s → 12ms query time ~94% fewer unnecessary UI checks ~64KB RAM for huge file processing 0 DB lookups for JWT validation 1 client call instead of many 14 days. 14 posts. 7 concepts. No theory. Just things that break (or save) real systems. Which one changed how you think about building systems? 👇 #BackendDevelopment #SoftwareDeveloper #Programming #Coding #DevCommunity #Tech #TechLearning #LearnToCode
Like Comment
To view or add a comment, sign in
Srikant Mahanty
3w
Report this post
Last night was a reminder. Everything was perfect. ✔️ APIs returning 200 ✔️ Queries optimized ✔️ Latency within limits Then we deployed to production… And suddenly — everything broke. No obvious errors. No clear failures. Just a system that *looked healthy* but wasn’t. In backend engineering, I’ve learned this the hard way: 👉 If it works locally, it means nothing. 👉 If it works in production, it means everything. --- **What actually goes wrong in production?** It’s rarely your business logic. It’s the things we underestimate: • Configuration drift between environments • Connection pool exhaustion under real traffic • Hidden race conditions in singleton beans • Network latency & downstream dependencies • Data scale (100 rows vs 10 million rows) • Missing timeouts → threads silently stuck **The biggest pitfall?** We validate systems in isolation… But production runs them in **chaos**. Multiple instances. Concurrent users. Unpredictable load. That’s where real engineering begins. **My debugging mindset (war-room mode):** 1. What changed between local and prod? 2. Are we observing the system — or assuming? 3. Can I trace one request end-to-end? 4. Is the system stateful where it shouldn’t be? 5. Are failures silent (timeouts, retries, thread blocks)? --- **Key takeaway:** Don’t just write code that works. Design systems that survive. --- Production doesn’t test your code. It tests your assumptions. #BackendEngineering #SpringBoot #DistributedSystems #ProductionIssues #SoftwareArchitecture #Learning #EngineeringMindset
Like Comment
To view or add a comment, sign in
Tunga

7,064 followers
1w
Report this post
New piece from Ernesto Spruyt on how the work of software developers is changing in 2026. The production of code has gotten cheap. The architectural, specification, and review work around it has become more expensive. That shift has implications for how teams are built and where value accumulates. https://lnkd.in/eF2NRm7h

https://www.garudax.id/pulse/where-weight-software-work-moving-ernesto-spruyt-138ae Ernesto Spruyt on LinkedIn
Like Comment
To view or add a comment, sign in
Beshoy Wagih
1w
Report this post
Logging is one of those things we all use… but rarely design properly. A question I keep seeing: 👉 When should logging be async? --- 🔷 When async logging makes sense Use async logging when: - High-throughput systems (trading, APIs under heavy load) - Logging is frequent (INFO/DEBUG in hot paths) - You want to reduce request latency 👉 Idea: don’t block the main thread just to write logs --- 🔷 When NOT to use it - Low-traffic systems → adds unnecessary complexity - Critical logs (failures, audits) → risk of losing logs on crash - Debugging issues where ordering matters strictly --- 🔷 Log Levels & Performance Impact Not all logs are equal: - DEBUG / TRACE 🚨 Very expensive if enabled in production → String building + high frequency - INFO ⚖️ Moderate cost → Safe if meaningful (not noisy) - WARN / ERROR ✅ Low frequency, high value → Always keep 👉 Biggest mistake: logging too much, not logging smart --- 🔷 Async Logging Trade-offs Pros: - Lower latency - Better throughput - Non-blocking I/O Cons: - Possible log loss on crash - More complex configuration - Harder debugging (timing/order) --- 🔷 Key Config Concepts (Async) 1. Queue Size (Buffer) - Holds logs before writing - Small → safer, but may block - Large → faster, but risk memory pressure 2. Discard Threshold - When queue is almost full → drop low-level logs (DEBUG/INFO) - Protects system under load 3. Blocking vs Non-blocking - Blocking → safer (no loss), but impacts performance - Non-blocking → faster, but may drop logs 👉 This is always a trade-off between performance vs reliability --- 🔷 Real-world mindset - For critical systems → keep ERROR/WARN reliable (sync or protected async) - For high-volume logs → use async + discard strategy - For debugging → enable selectively, not globally --- 🔷 Final Thought Logging is not just “print statements”. It’s part of your system design. Good engineers log. Great engineers design logging with intent and trade-offs. #Backend #Java #SpringBoot #SystemDesign #Performance #TechLeadership
Like Comment
To view or add a comment, sign in
Srikant Mahanty
4w Edited
Report this post
One small mistake in code can teach you more about software architecture than reading ten design pattern books. Early in my career, I wrote a simple check like this: if (userId.equals("admin")) { ... } It worked perfectly in testing. It worked in staging. Then one day in production — boom — NullPointerException. Reason? userId was null for one edge case request. That day I learned a lesson I never forgot: "admin".equals(userId) is not just a syntax change. It is defensive programming. It is thinking about failure before it happens. It is architecture mindset, not just coding. Good developers write code that works. Experienced developers write code that still works when things go wrong. Architects design systems assuming everything will eventually go wrong. This applies everywhere: * Null checks * Retry mechanisms * Idempotency * Circuit breakers * Caching * Database indexing * Distributed systems * Concurrency Architecture is not only about microservices, Kafka, Kubernetes, or system diagrams. Architecture is about anticipating failure, edge cases, scale, and human mistakes. Most production issues don’t happen because of complex algorithms. They happen because of small assumptions like: * This value will never be null * This API will always respond * This query will always be fast * This service will never fail * This user will never send wrong data Real engineering starts when you stop assuming and start defending. Write code like production will try to break it. Because one day, it will. #connection #learn
Like Comment
To view or add a comment, sign in
Jahangeer Abbas
3w
Report this post
After optimizing 50+ codebases, I've noticed the pattern: most developers default to adding complexity when the answer is subtraction. The counterintuitive fix? Write less code. --- Here's what I've seen repeatedly: A typical "optimization" starts with adding a caching layer, then a queue system, then monitoring. The codebase grows 40% before performance improves 5%. Meanwhile, the real culprit was a single N+1 query or an inefficient loop that took 2 hours to fix. When code feels slow or brittle, developers instinctively reach for frameworks, libraries, or architectural patterns. It feels productive. But I've watched teams spend weeks building abstractions for problems that didn't exist yet. One client had 12,000 lines of utility functions. We deleted 8,000 of them. Performance improved. Bugs decreased. New developers onboarded faster. The math is simple: fewer lines means fewer bugs, faster debugging, easier maintenance. But it requires discipline. Deletion feels like admitting the original approach was wrong. The teams that ship faster aren't writing more—they're writing less and making every line count. What's the most unnecessary complexity you've removed from a codebase that actually made things better?
Like Comment
To view or add a comment, sign in