Observability & Debugging in Distributed Systems

Farid E.

Published May 23, 2025

Monitoring vs. Observability: From Dashboards to Deep Insights

Traditional monitoring relies on predefined dashboards and alerts to signal when something is wrong. It’s effective for known issues (“known unknowns”) but often limited to surface-level symptoms. Observability, on the other hand, is about being able to ask new questions of your system and uncover the why behind problems - even those you didn’t anticipate (“unknown unknowns”). In short, monitoring tells you when something is broken, while observability helps you understand why. Observability achieves this by aggregating all the data (logs, metrics, traces) from across your stack to provide real-time, holistic insight and pinpoint root causes. An analogy: if monitoring gives you warning lights on a cockpit dashboard, observability is like having a flight recorder and radar - you get the full story of what’s happening inside the system, not just an alarm bell.

The Three Pillars of Observability are often cited as logs, metrics, and traces. Each pillar offers a different view into system behaviour: logs are detailed event records, metrics are numeric measures tracked over time (e.g. CPU usage, request rate), and traces capture end-to-end request flows through distributed services. Modern observability emphasises using these in unison and in context, rather than siloed. For example, metrics might show a spike in error rate, and tracing can then reveal which service and call caused it. Unlike old-school monitoring that might just check if a server is up or a CPU threshold is crossed, observability lets you dig deeper into the “what happened and why” of your system’s state at any moment. It’s the difference between simply knowing a car’s check-engine light is on versus hooking it up to a diagnostic tool to read the exact fault code.

A key practice is structured logging with correlation IDs. In a monolith, you might have tailed one big log file; in microservices, logs are scattered across many services. To make sense of them, engineers move to structured logs (e.g. JSON format) that include standardised fields like timestamp, service name, severity, and crucially a request or correlation ID. Every incoming request gets a unique ID that is passed along to downstream calls. All logs emitted in processing that request then carry the same ID. This way, when debugging, you can grep or query logs by the ID and reconstruct the exact journey of that transaction across services. It’s like putting a “tracking number” on each user request. Instead of wading through a wall of unstructured log text, you can instantly correlate events and see the story unfold. Adopting consistent, machine-parsable log formats and propagating trace/context IDs makes your logging far more useful for distributed debugging than old free-form print statements ever could. (Pro tip: also log contextual data like user IDs or session IDs where relevant - these enrich your logs with meaning.)

Why Observability Matters in Distributed Systems

In a simple monolithic application, monitoring was relatively straightforward - the app was either up or down, and you could often SSH in to inspect it. But in modern microservices, there are dozens or hundreds of moving parts. The system is no longer just “on” or “off” - it can be in one of countless states due to partial failures and complex interactions. In practice, system behaviour becomes emergent: one service might be slow or failing in isolation, yet the overall application still limps along, just with degraded performance. It’s much harder to correlate what a user experiences (“the site is slow”) to which backend service is misbehaving. As one engineer described it , in a microservice world “uptime” isn’t a single number anymore - if the system is slow, which part is slow? The frontend? An upstream service? The database? Or some combination?. Partial failures are expected - portions of the system can fail while others continue working - and these failures can be non-deterministic and hard to detect with simplistic monitors. A microservice architecture is often operating in a state of graceful degradation, which makes root cause analysis a real challenge without deep visibility.

A small issue in one component can ripple into a major incident. For example, imagine an e-commerce site with separate services for the cart, orders, payments, etc. One day, a database query in the “add-to-cart” service suddenly slows down by 200ms. That small latency spike causes requests to that service to queue up. Soon the upstream Cart API (which calls the service) hits its thread pool limits, and its responses to users become slow or time out. This propagates upward - users experience sluggishness or errors on the site, and other services retry calls, further amplifying the load. A seemingly minor slow query in one microservice ends up cascading through the system and degrading the entire app’s performance. Traditional monitoring might catch symptoms - e.g. an alert on high latency in the Cart API or a surge in HTTP 500 errors - but it won’t tell you the underlying cause. You’d see a bunch of red lights on a dashboard but not know which of the dozens of internal calls is the culprit. In our example, a basic dashboard might show that the “Add to Cart” page is slow and that CPU is high on one service, but it doesn’t automatically reveal why. This is where observability shines: because you have instrumented the system to collect granular traces and logs, you could pull up a distributed trace of a slow request and see that the AddToCartService -> DB query span took 500ms instead of 50ms. Indeed, observability tools can trace the entire transaction across services and pinpoint “service A3’s database call is the bottleneck”. With that context, engineers can immediately zero in on the root cause (maybe a missing index in the database or a code regression in that service) instead of guessing across myriad components. In complex distributed environments, observability is crucial because failures are often partial and hidden - the system might be up, but unwell - and only by correlating signals (metrics, logs, traces) can you unravel the chain of events that led to an incident.

Tools and Patterns for Observability at Scale

Building robust observability involves a combination of tools and best practices. Seasoned engineers don’t just rely on one tool - they assemble a stack that covers metrics, logs, and tracing, often using open standards to avoid reinventing the wheel.

OpenTelemetry (OTel) has emerged as the industry-standard instrumentation framework. It’s an open-source project (part of CNCF) that provides a unified approach to collecting metrics, logs, and traces from your applications. Instead of using a different client library for each monitoring vendor, you instrument your code with OpenTelemetry APIs, and you can export that telemetry to any backend. Essentially, instrument once, use anywhere. OpenTelemetry encourages developers to sprinkle trace spans, metrics, and log context in the code where it matters, in a standardized way. Adopting it means your services are “observability-ready” out of the box, and it future-proofs you since you can switch or send data to different analysis tools without changing your instrumentation.
Metrics and Dashboards (Prometheus & Grafana): For metrics, a common duo is Prometheus and Grafana. Prometheus is an open-source time-series database that scrapes numeric metrics from your apps and infrastructure, storing them efficiently. You might expose endpoints like /metrics from each service (using client libraries) and Prometheus pulls these stats periodically (CPU usage, request counts, latencies, etc.). It’s particularly popular for monitoring containerized and Kubernetes environments due to its reliability and powerful query language (PromQL). Grafana complements it by providing a rich visualisation layer. With Grafana, you can build interactive dashboards charting those metrics and set up alerts (e.g. trigger an alert if error rate > 5% for 5 minutes). Together, Prometheus and Grafana are a powerhouse for real-time monitoring of system health - and they’re free and CNCF-supported, which is why so many organisations use them. An experienced engineer will use these tools not just for pretty graphs but to implement proven monitoring strategies. For example, they might apply the RED method to their services: track each service’s Rate (requests per second), Errors (error rate), and Duration (latency) as top-line indicators. RED metrics focus on the user experience of each service - how often it’s called, how often it fails, and how slow or fast it is. Complementing that, they use the USE method for underlying resources: monitor Utilization (e.g. CPU% busy), Saturation (queued work or resource pressure), and Errors on each server or component. The RED method tells you if users are unhappy (e.g. high error rate or latency on a service), while the USE method can tell you why (e.g. the database CPU is saturated or erroring). By combining these models, you get both a high-level and low-level view - if a RED metric like request latency spikes, you immediately check USE metrics which might reveal, say, a specific node’s disk is at 100% utilisation causing the slowdown. These metrics models are mental frameworks that guide what to measure so you’re not flying blind.
Distributed Tracing (Jaeger, Zipkin, etc.): In a distributed system, tracing is a game-changer for understanding cross-service workflows. Jaeger is a popular open-source tracing system (originated at Uber) that works brilliantly with OpenTelemetry. When a user request comes in, it gets assigned a trace ID, and as it propagates through multiple microservices, each service creates a span (a timed operation) representing its work on the request. Jaeger collects these spans and stitches them together into a trace, which is essentially a timeline of the request across services. (A span is the basic unit of work in tracing - a named, timed operation, and multiple spans link together to form the full trace.) For example, with Jaeger you might visualize that a transaction TraceID=123 went through Service A (Span A), then called Service B (Span B), which in turn queried Database C (Span C). If Service B took 2 seconds of that trace while others took 50ms, you’ve pinpointed the slowdown. This is immensely powerful for debugging latency and dependency issues. Jaeger’s UI lets you search traces (e.g. “show me traces slower than 3s”) and then inspect the waterfall of spans to see where time was spent or where errors occurred. OpenTelemetry plays nicely here: you instrument your code with OTel to produce traces, and you can export them to Jaeger (or other tracing backends like Zipkin, Honeycomb, etc.). Tracing inculcates a span mindset in developers - instrumenting each critical step (calls to other services, external API calls, database queries) with spans so that in production, you have a “X-ray” of how each request traveled.

Example: A distributed trace for an “Add to Cart” operation. Each service (Cart App, Cart Service, Cart Commit) records a span with a shared Trace ID (611886bf5382723a). The trace is composed of spans linked by parent-child relationships (note how each span shows its parent ID). This end-to-end trace allows engineers to see the entire request path and timing across microservices.

Centralised Log Management: With many microservices, having a central place to search and analyse logs is vital. Tools like the ELK stack (Elasticsearch, Logstash, Kibana) or cloud services (e.g. Splunk, Datadog) are commonly used to aggregate logs from all services. Kibana or Grafana Loki, for instance, can index your structured logs so you can quickly query “show all error logs with request_id=XYZ in the last hour” - extremely handy during incidents. A good log aggregation setup, combined with structured logs, turns your millions of log lines into a searchable timeline of events. Many teams also implement log sampling for very high-volume logs - e.g. only keep a representative 10% of debug logs or drop superfluous logs - to control costs and noise. Strategic sampling can drastically cut log volume while still preserving the insights you need. The goal is to have the logs that matter when you need them (and not break the bank storing every trivial detail). As with all observability data, it’s about signal over noise.
Full-Stack Observability Platforms: In addition to open-source tools, experienced engineers often leverage managed all-in-one platforms like Datadog, New Relic, or Honeycomb. These platforms combine metrics, traces, and logs (and sometimes user experience monitoring) into one integrated product. For example, Datadog offers infrastructure monitoring, application performance monitoring (APM), log management, and more, all unified in a single cloud dashboard. It comes with hundreds of pre-built integrations (for AWS, Docker, databases, etc.), so it can auto-discover a lot of your system’s telemetry. Such platforms often include advanced features like anomaly detection (using AI to spot weird patterns) and curated dashboards out of the box. Honeycomb.io, on the other hand, is known for its focus on high-cardinality event data - it encourages capturing very detailed context for each event/trace and lets you slice-and-dice those events ad hoc to find “needles in the haystack”. This can reveal outliers that would be lost in aggregate metrics. These hosted solutions can accelerate a team’s observability maturity (since you don’t have to run the data storage and UI yourself), though they come with significant cost. The trade-off with an all-in-one platform is paying for convenience and scale, but you gain speed of analysis - e.g. jump from an error metric to the related traces to the exact error logs in one UI. Many senior engineers familiarise themselves with at least one such platform, because each company might use a different one (Datadog, Dynatrace, etc.), but the core ideas transfer. The specific tools will evolve, but the patterns remain: instrument your code, collect telemetry, and use visualisations and tracing to understand system behaviour. The tools mentioned - OpenTelemetry, Prometheus/Grafana, Jaeger, Datadog, Honeycomb - are all in the toolbox of a well-rounded engineer who wants to observe and not just monitor systems.

Recommended by LinkedIn

Logging in Distributed Systems: A Crucial…

Migo Lee 8 months ago

Distributed Systems Overview using Stacked Assumption…

Arpit Rathi 6 months ago

Distributed Systems Don’t Fail Because of Code - They…

Rahul Suryawanshi 3 months ago

Debugging Distributed Systems Like a Senior Engineer

When production issues strike, a junior engineer might feel like they’re flying a plane through fog - alarms blaring and no clear idea where the problem lies. A senior engineer, by contrast, approaches debugging with a calm, methodical mindset, leveraging observability to systematically narrow down the issue. It’s very much like a detective solving a mystery: gather clues, form hypotheses, and eliminate possibilities, all while keeping a cool head.

Start broad, then narrow the search space. In a complex outage, there may be dozens of signals screaming for attention. A seasoned troubleshooter will resist the urge to panic or jump to conclusions. Instead, they check the key known unknowns first: the high-level dashboards (e.g. are error rates spiking globally or just in one service? Is it all users or just a region? Is the database CPU maxed out?). This helps scope the problem - is it front-end, back-end, a specific dependency? Then they methodically drill down. A useful strategy is binary search: keep splitting the problem space to isolate the fault. For example, if you suspect a latency issue, test half the system by calling a downstream service directly - is it fast or slow? If slow, go deeper into that half; if not, focus elsewhere. By iteratively halving the “search space” of possible causes, you converge on the culprit quickly. Throughout this, a senior engineer is forming hypotheses (“Could it be the recent deployment on service X? Or a spike in traffic? Maybe a memory leak?”) and then using data to confirm or refute each. They leverage those rich observability tools: check trace logs for a common thread (maybe all slow requests share a specific userId or all error traces point to one dependency), pull up logs for error IDs, and use dashboards as a guide, not the gospel. Importantly, they know when to stop and gather more data. If something doesn’t add up, they might add custom instrumentation in real-time (in some cases using dynamic tracing tools or flipping on more verbose logging via a feature flag) to illuminate the dark corners. This is the “known unknowns vs unknown unknowns” balance - you investigate the known suspects, but you’re prepared to explore new angles when the usual checks don’t pan out.

Stay calm under fire. It sounds cliché, but maintaining poise is a skill. Rattled engineers can thrash—changing too many things at once, or missing obvious clues. Experienced engineers treat incidents like a doctor treating an ER patient: follow the ABCs (in our context, check the basics like CPU, memory, network first), stabilise the patient (mitigate immediate impact), and then diagnose deeper. They use runbooks and prior lessons - perhaps this failure mode has happened before. They don’t make random changes out of desperation; any mitigation (like rolling back a release or diverting traffic) is deliberate and measured. A calm demeanour also helps the team avoid the “blame game” and focus on facts, which shortens the time to resolution. Culturally, seniors often foster a blameless approach: the system is broken, not “Person X’s fault,” so everyone can freely point out clues without fear. This creates an environment where all data is considered.

Use progressive delivery techniques to your advantage. Many organisations today employ canary releases and feature flags, which are as much operational tools as development practices. A canary release means when you deploy a new version of a service, you first roll it out to a small subset of users or servers and watch it carefully. The term comes from coal miners using canary birds to detect toxic gas - here, a small user subset (“the canary”) experiences the change, and if something’s off (errors, latency) you detect it before all users do. If the canary version has issues, you can immediately roll it back or fix forward, limiting the blast radius of a bad change. Feature flags (toggles) allow you to deploy code into production but hide the new features behind a flag. You can then turn the feature on for a small group of users or turn it off instantly if it misbehaves - without redeploying code. This decouples deployment from release. In practice, during an incident, if you suspect a particular new feature is causing trouble, you can flip it off via the flag config and verify if the system stabilises. Dark launches are a related concept: you release a new service or feature to production but route only a trickle of traffic to it (or none to end-users at all) just to observe its behaviour under real load. It’s “dark” because the end-user isn’t aware of it - the service might be processing real requests in shadow mode, not impacting the user, and you’re gathering metrics on it. This is fantastic for testing things like a new algorithm’s performance in production conditions without risking customer experience. Senior engineers leverage these patterns to mitigate and debug issues. For instance, if a new service is suspected of causing cascading failures, they might dark-launch it (sending a copy of production traffic to it but not letting its responses go live) to reproduce the issue safely and gather observability data, all while the flag remains off to users. They also plan canary deployments such that any significant change can be quickly compared against the baseline - if metrics regress, the canary is killed fast. All these techniques help de-risk changes and provide valuable data for debugging. They give you controlled experiments in the live system.

Don’t drown in data - prioritise signal. With great observability comes great volumes of telemetry. A savvy debugger knows how to filter out noise. This might mean temporarily raising log levels for just one component or using sampling. For example, if you’re inundated with thousands of error logs per second, you might sample (record) only 1% of them - but ensure those contain representative context - so you can actually inspect one without timing out your log viewer. Many observability stacks allow dynamic log level changes or trace sampling, which you can use during an incident to get the info you need without overwhelming the system or yourself. It’s a bit like a detective deciding which clues are worth following; you can’t interview every single witness in town, so you focus on the ones likely to have the info you need. Over time, experienced engineers develop an intuition for which metrics or logs are most relevant for each class of problem (e.g. database-related issue vs network latency vs code regression). They also keep an eye on the unknowns: if a hypothesis isn’t panning out, they loop back and reconsider, rather than tunnel-visioning on one dashboard. This adaptability is key.

Finally, keep your cool. In the middle of a midnight outage, the best engineers project calm. They communicate what they’re checking, keep colleagues in the loop, and avoid making things worse by rushing. They use the scientific method: change one thing at a time, see if it helps. They understand that incidents are inevitable in complex systems (you can’t prevent every failure), so they treat each one as a learning opportunity rather than a personal failure. Post-incident, they’ll often add new monitors or traces (or even write a postmortem) so that exact issue won’t fool them twice - this is how systems (and people) get more resilient over time. Debugging a live distributed system is challenging, but with an observability mindset and steady nerves, it becomes a detective game you can systematically win.

Common Observability Anti-Patterns (and How to Avoid Them)

Like any engineering practice, observability can be done well - or poorly. In pursuit of becoming 10x engineers, it’s important to recognise some anti-patterns that teams fall into:

Over-reliance on Metrics (or Logs) Without Context: Metrics are fantastic for a quick health check, but remember they are aggregated information - they can tell you something is wrong but often not why. Logs give details but can overwhelm you with minutiae and lack the big picture. Treating metrics and logs as the whole story while ignoring tracing and context is a mistake. Metrics might show CPU at 90%, but which request caused it? Logs might show an exception, but what led up to it? As one observability expert put it, metrics and logs alone “provide data points, but without tying them together, you don’t have enough context to understand the system at a high level”. Traces should be treated as first-class citizens, not an afterthought - they connect the dots between metrics and logs. An anti-pattern is having a “wall of dashboards” and thinking more graphs = more insight. A dozen Grafana panels might look impressive, but if they’re not actionable or correlated, they’re just noise. Always aim to correlate signals: link your logs and metrics to traces or request IDs. Observability isn’t about piling on charts; it’s about making sure you can follow an issue across the system easily.
Logging Everything (aka Data Hoarding): More data is not always better. It’s a natural reaction: “We had an outage and didn’t have X logged; let’s log absolutely everything going forward.” Soon, teams end up emitting millions of log lines per minute, thousands of metrics and traces for every function - and they drown in it. This observability bloat not only incurs huge storage costs, it also buries the signal in tons of noise. Engineers get alert fatigue (if everything is critical, nothing is), and they lose trust in the monitoring because it’s always red somewhere. One telltale sign is the proliferation of dashboards and alerts that no one on the team understands or maintains. As one SRE observed, “teams enable every metric, log line, and trace ‘just in case.’ Dashboards multiply like rabbits. Alerts fire for conditions no one can explain. Traces are collected but never looked at. The result is metrics fatigue and dashboard burnout”. All that data without clarity actually lengthens outages because you’re swimming in irrelevant info. This accumulation of noisy, uncurated telemetry is sometimes dubbed “observability debt.” It’s analogous to technical debt: too many tools, too many dashboards, and too little curation, which ultimately slows you down. To avoid this, be intentional about what you log/monitor. Collect data that has a purpose. It’s far better to have a few well-chosen high-cardinality metrics and logs with rich context than a flood of trivial ones. Regularly prune unused dashboards and metrics. If nobody knows what a dashboard is for, consider deleting it (or at least hiding it) - stale telemetry is noise. Think of it as gardening: weed out the useless monitors so the valuable ones stand out.
Context-Switching and Siloed Tools: If your metrics live in one tool, logs in another, traces in a third, and these aren’t integrated, you’re making life harder. Jumping between disparate systems to correlate a single request is tedious and error-prone. While it might not be an anti-pattern per se (tools are tools), best practice is to enable quick correlation. Many modern platforms or even custom integrations allow you to jump from, say, an alert about high latency directly to a trace view of the slow requests, or from a trace span directly to the logs from that span. If you don’t have an out-of-the-box way to do this, invest some time in scripts or links that can pivot you between data sources by IDs or timestamps. The time to stitch clues together in an incident is precious. You don’t want to be manually cross-referencing timestamps from Grafana to Kibana to find the matching logs - set up integration if possible (for example, log the trace ID and surface it in metric labels, so you can connect the dots).

In summary, avoid the traps of too much data with too little insight, and not enough data where it counts. Observability should be seen as an ongoing commitment, much like testing. It’s not glamorous - setting up dashboards or writing trace instrumentation doesn’t directly ship new features - but it pays dividends the first time you save hours (or days) troubleshooting a hairy issue. As Martin Kleppmann notes in Designing Data-Intensive Applications, complex systems fail in complex ways, so we must arm ourselves with information to untangle those failures. The goal is clarity: to quickly understand your system’s behaviour. Every alert, log, or metric should have a reason to exist and a playbook attached. If you nurture your observability (and trim the excess), it becomes a superpower: you’ll debug faster, deploy with confidence, and maybe even sleep better when on-call.

Next time you’re working on a service, ask yourself - if this breaks at 2 AM, will I be able to quickly find out why? If the answer is “not sure,” take some time to add that log, that metric, or that trace span today. Your future self (and your teammates) will thank you when the inevitable hiccups happen. Observability is an investment in your system’s maintainability. And when it comes to distributed systems at scale, maintainability and debuggability are just as important as raw performance or throughput. By mastering observability, you’re not only solving today’s bug, you’re building a culture and architecture that can continually improve and handle the unknown unknowns of tomorrow.

Observability & Debugging in Distributed Systems

Farid E.

Monitoring vs. Observability: From Dashboards to Deep Insights

Why Observability Matters in Distributed Systems

Tools and Patterns for Observability at Scale

Recommended by LinkedIn

Debugging Distributed Systems Like a Senior Engineer

Common Observability Anti-Patterns (and How to Avoid Them)

Road To 10x - System Design

93 followers

More articles by Farid E.

Others also viewed

Multi-tier Application on Docker Swarm using AES256 Encryption.

Saga Design Pattern (.net core)

When Do Go Goroutines Actually Run in Parallel?

SoC-based distributed control system.

Understanding Inter-component communication in Distributed Systems

Multi-Agent Systems Are Not Magic They Are Distributed Systems with LLMs in the Middle

A Journey of Building a Multi-Agent, Self-Learning Monitoring System

Actor Model Pattern - Building a Scalable Framework for Connected Edge Scenarios

How to Understand the Importance of Observability

How to Maximize Observability in Systems

Benefits of Deep Observability in IT Operations

Tools for Observability in Software Development

Explore content categories

Monitoring vs. Observability: From Dashboards to Deep Insights

Why Observability Matters in Distributed Systems

Tools and Patterns for Observability at Scale

Recommended by LinkedIn

Debugging Distributed Systems Like a Senior Engineer

Common Observability Anti-Patterns (and How to Avoid Them)

Road To 10x - System Design

93 followers

More articles by Farid E.

🐍 Continuous Learning in Python

Building Real-World Object Detection Systems

How Modern Speech-to-Text Systems Work (A Systems Engineering Perspective)

Monitoring, Observability, and Alerting in Python: A Practical Guide

How ChatGPT “Consumed the Internet”: Web-Scale Crawling & Data Pipeline

From Local Script to Production‑Ready Python: A Practical Guide

Designing an AI-Wrapper Architecture for a Code Assistant

Mastering SQLAlchemy for Scalable Data-Intensive Python Apps

Deep Dive: Shopify’s System Design at Global Scale

Mastering Debugging: From Print Statements to Production-Ready Techniques

Others also viewed

Multi-tier Application on Docker Swarm using AES256 Encryption.

Saga Design Pattern (.net core)

When Do Go Goroutines Actually Run in Parallel?

SoC-based distributed control system.

Understanding Inter-component communication in Distributed Systems

Multi-Agent Systems Are Not Magic They Are Distributed Systems with LLMs in the Middle

A Journey of Building a Multi-Agent, Self-Learning Monitoring System

Actor Model Pattern - Building a Scalable Framework for Connected Edge Scenarios

Similar topics

How to Understand the Importance of Observability

How to Maximize Observability in Systems

Benefits of Deep Observability in IT Operations

Tools for Observability in Software Development

Explore content categories