Top LinkedIn Content on Understanding System Observability

Polyglot International Speaker | AWS Container Hero | CNCF Ambassador | Docker Captain | KCD NY Organizer

23,175 followers 2y

Imagine you’re driving a car with no dashboard — no speedometer, no fuel gauge, not even a warning light. In this scenario, you’re blind to essential information that indicates the car’s performance and health. You wouldn’t know if you’re speeding, running out of fuel, or if your engine is overheating until it’s potentially too late to address the issue without significant inconvenience or danger. Now think about your infrastructure and applications, particularly when you’re dealing with microservices architecture. That's when monitoring comes into play. Monitoring serves as the dashboard for your applications. It helps you keep track of various metrics such as response times, error rates, and system uptime across your microservices. This information is crucial for detecting problems early and ensuring a smooth operation. Monitoring tools can alert you when a service goes down or when performance degrades, much like a warning light or gauge on your car dashboard. Now observability comes into play. Observability allows you to understand why things are happening. If monitoring alerts you to an issue, like a warning light on your dashboard, observability tools help you diagnose the problem. They provide deep insights into your systems through logs (detailed records of events), metrics (quantitative data on the performance), and traces (the path that requests take through your microservices). Just as you wouldn’t drive a car without a dashboard, you shouldn’t deploy and manage applications without monitoring and observability tools. They are essential for ensuring your applications run smoothly, efficiently, and without unexpected downtime. By keeping a close eye on the performance of your microservices, and understanding the root causes of any issues that arise, you can maintain the health and reliability of your services — keeping your “car” on the road and your users happy.

+2

12 Comments

Shubham Srivastava

Principal Data Engineer @ Amazon | Data Engineering

63,959 followers 9mo

Imagine you’re a data engineer. It’s 3 AM on a Friday. You’re home, asleep, but back in the office, your data pipeline is busy. And tonight, a bug sneaks into production. Just a tiny change, a single wrong script runs. Nobody notices at first (well, cause they’re busy on the weekend) Suddenly, fake transactions start landing in your main tables. Customer data gets mixed up. Dashboards shift, and nobody knows why. Years ago, this would have been a nightmare. By Monday morning, you’d be scrambling to guess what happened and where the mess began. But tonight is different, Because every step your data takes is recorded. Your system has data lineage. It’s like having security cameras for your entire pipeline. Every row knows where it came from, every script leaves a footprint, and every transformation is logged. So when you wake up and check the dashboard, you see the story: ↬ What script ran ↬ When it started ↬ Which tables it touched ↬ Where the wrong values spread You hit rewind, isolate the problem, and fix only what needs fixing. And as a result, there will be no mass panic or engineers searching endlessly. You can get answers even at 3 AM! This is the power of data lineage and observability: That’s how you sleep well as a data engineer. That’s how you build pipelines you can trust. – P.S: Did you learn something new with this post? Would you want more posts like this?

59 Comments

Arpit Bhayani

278,117 followers 2mo

Most systems detect node or master failures using simple polling, and while this approach sounds straightforward, it has an interesting reliability issue... The typical approach is to observe a node directly. This usually means pinging it, checking if a port is open, or running a lightweight query to confirm it is alive. On paper, this seems fine, but all of these methods share the same weakness - what if the observer itself is wrong? In a distributed setup, network glitches are normal. Temporary packet loss, routing hiccups, or partial network partitions can easily make a healthy node appear unreachable to the observer. The usual way to deal with this is to retry multiple times and declare failure after the n-th consecutive failure. This creates a classic tradeoff. If n is small (or polling happens frequently), failure detection becomes fast, but false positives increase. A short-lived network blip can trigger an unnecessary failover, which can sometimes be more disruptive than the original issue. If n is large (or polling intervals are longer), false positives decrease, but real failures take longer to detect. That delay directly increases downtime. But there is a more reliable way to think about this problem when you already have a cluster of nodes available. Instead of relying on a single observer repeatedly polling a target node, you can allow multiple nodes in the cluster to independently perform health checks. The system then treats a node as failed only when a majority of observers agree that the node is unreachable. This consensus-based approach reduces the risk of false positives caused by network partitioning. Even if one observer loses connectivity, the rest of the cluster can still provide an accurate view of system health. Consensus is costly, so this approach is not the most cost-efficient. However, it can be very useful if your system is large enough and distributed across multiple geographies.

10 Comments

Aurimas Griciūnas

Founder @ SwirlAI • Ex-CPO @ neptune.ai (Acquired by OpenAI) • UpSkilling the Next Generation of AI Talent • Author of SwirlAI Newsletter • Public Speaker

183,367 followers 5mo

I have been developing Agentic Systems for the past few years and the same patterns keep emerging. 👇 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗗𝗿𝗶𝘃𝗲𝗻 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 is the most reliable way to be successful in building your 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 - here is my template. Let’s zoom in: 𝟭. Define a problem you want to solve: is GenAI even needed? 𝟮. Build a Prototype: figure out if the solution is feasible. 𝟯. Define Performance Metrics: you must have output metrics defined for how you will measure success of your application. 𝟰. Define Evals: split the above into smaller input metrics that can move the key metrics forward. Decompose them into tasks that could be automated and move the given input metrics. Define Evals for each. Store the Evals in your Observability Platform. ℹ️ Steps 𝟭. - 𝟰. are where AI Product Managers can help, but can also be handled by AI Engineers. 𝟱. Build a PoC: it can be simple (excel sheet) or more complex (user facing UI). Regardless of what it is, expose it to the users for feedback as soon as possible. 𝟲. Instrument your application: gather traces and human feedback and store it in an Observability Platform next to previously stored Evals. 𝟳. Run Evals on traced data: traces contain inputs and outputs of your application, run evals on top of them. 𝟴. Analyse Failing Evals and negative user feedback: this data is gold as it specifically pinpoints where the Agentic System needs improvement. 𝟵. Use data from the previous step to improve your application - prompt engineer, improve AI system topology, finetune models etc. Make sure that the changes move Evals into the right direction. 𝟭𝟬. Build and expose the improved application to the users. 𝟭𝟭. Monitor the application in production: this comes out of the box - you have implemented evaluations and traces for development purposes, they can be reused for monitoring. Configure specific alerting thresholds and enjoy the peace of mind. ✅ 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻: ➡️ Run steps 𝟲. - 𝟭𝟬. to continuously improve and evolve your application. ➡️ As you build up in complexity, new requirements can be added to the same application, this includes running steps 𝟭. - 𝟱. and attaching the new logic as routes to your Agentic System. ➡️ You start off with a simple Chatbot and add a route that can classify user intent to take action (e.g. add items to a shopping cart). What is your experience in evolving Agentic Systems? Let me know in the comments 👇

34 Comments

Pooja Jain

194,429 followers 2w

It takes 10 minutes to fix a crash. It takes 3 days to find a silent data quality error. Most data architectures fail quietly. They don't break on launch day. They break on day 90, when nobody remembers the decision that caused it. Here’s what that looks like in practice: INGESTION ✕ Pull everything, filter later ✓ Validate at the edge Bad data is cheapest to kill at entry. Let it in and it travels everywhere. ✕ No schema contract with the source ✓ Agree on types and nullability upfront Upstream changes without a contract = your problem, not theirs. STORAGE ✕ One giant table, query it all ✓ Partition by how the data is actually read Wrong partitioning doesn’t error. It just costs you forever. ✕ Mix raw and transformed in the same layer ✓ Separate raw, cleaned, and serving You will always need to reprocess. Design for it. TRANSFORMATION ✕ Transform then validate ✓ Validate then transform You can’t trust output built on dirty input. ✕ Logic buried inside SQL joins ✓ Explicit, tested, documented If only one person understands it, it’s already a liability. ORCHESTRATION ✕ Trigger jobs on a schedule ✓ Trigger on data arrival and completeness Schedules don’t know if the data actually showed up. ✕ No dependency mapping ✓ Every pipeline knows what it needs before it runs Silent upstream failure + blind downstream trigger = corrupted output, zero alerts. OBSERVABILITY ✕ Alert only when the pipeline crashes ✓ Alert when data behaves unexpectedly A crash is obvious. Quietly wrong data isn’t. GOVERNANCE ✕ Give access on request, document once ✓ Define ownership, lineage, and living docs When something breaks, lineage is the difference between 10 minutes and 3 days.. Most engineers optimize what’s visible. Great architects design for what breaks. Before your next diagram, ask: What hidden failure am I introducing today? 💡 Save this for your next design review. 🔖Tag an engineer who needs to see it. #data #engineering #systemdesign #cloud #intellingence #business #growth

65 Comments

Spiros Xanthos

Founder and CEO at Resolve AI 🤖

18,200 followers 1mo

Roughly seven months ago there was a noticeable shift in how engineers at Resolve AI worked, and no I'm not talking about AI coding. What I noticed is that they stopped querying observability tools directly. Instead of opening dashboards or crafting queries manually, they started routing investigations through Resolve. It wasn't a top-down decision, engineers just gradually stopped going direct once the agents were consistently getting them accurate answers faster. I talked about this recently with Tom Wilkie, Manoj Acharya, and Cyril TOVENA on the Grafana Labs Big Tent Podcast. Every major observability platform was designed for human operators, from the query languages to the interfaces. All of it assumes a human is on the other end, running a handful of queries during an incident. Agents don't work that way. They query constantly, pull from multiple systems simultaneously, and need API throughput most platforms weren't architected to handle. When they hit a bottleneck, they don't wait. They route around it. There's a practical question here for anyone running production systems at scale. What's the API throughput ceiling before you get rate-limited? What happens to your bill when query volume goes up by an order of magnitude? Can an agent traverse metrics, logs, and traces in a single investigation without hitting access gaps between tools? The organizations moving quickest on this are evaluating observability vendors not on dashboard quality or ingestion pricing, but on whether the platform is ready to be operated by agents as the primary interface. Lastly, the old argument in observability has always been that consolidation wins (i.e., all data on a single platform, reduced tool sprawl), but I think agents actually reverse that logic. If an agent can query five specialized systems and synthesize results faster than a human can navigate one general-purpose platform, the case for specialized tooling gets stronger, not weaker. The glue between systems isn't a human anymore. It's the agent. Listen to the full episode below in the comments.

4 Comments

Gurumoorthy Raghupathy

Expert in Solutions and Services Delivery | SME in Architecture, DevOps, SRE, Service Engineering | 5X AWS, GCP Certs | Mentor

14,141 followers 6mo

🚀 Building Observable Infrastructure: Why Automation + Instrumentation = Production Excellence and Customer Success After building our platform's infrastructure and application automation pipeline, I wanted to share why combining Infrastructure as Code with deep observability isn't optional—it's foundational as shown in screenshots implemented on Google Cloud. The Challenge: Manual infrastructure provisioning and application onboarding creates consistency gaps, slow deployments, and zero visibility into what's actually happening in production. When something breaks at 3 AM, you're debugging blind. The Solution: Modular Terraform + OpenTelemetry from Day One with our approach centered on three principles: 1️⃣ Modular, Well architected Terraform modules as reusable building blocks. Each service (Argo CD, Rollouts, Sonar, Tempo) gets its own module. This means: 1. Consistent deployment patterns across environments 2. Version-controlled infrastructure state 3. Self-service onboarding for dev teams 2️⃣ OpenTelemetry Instrumentation of every application during onboarding as a minimum specification. This allows capturing: 1. Distributed traces across our apps / services / nodes (Graph) 2. Golden signals (latency, traffic, errors, saturation) 3. Custom business metrics that matter. 3️⃣ Single Pane of Glass Observability Our Grafana dashboards aggregate everything: service health, trace data, build pipelines, resource utilization. When an alert fires, we have context immediately—not 50 tabs of different tools. Real Impact: → Application onboarding dropped from days to hours → Mean time to resolution decreased by 60%+ (actual trace data > guessing) → nfrastructure drift: eliminated through automated state management → Dev teams can self-service without waiting on platform engineering Key Learnings: → Modular Terraform requires discipline up front but pays dividends at scale. → OpenTelemetry context propagation consistent across your stack. → Dashboards should tell a story by organising by user journey. → Automation without observability is just faster failure. You need both. The Technical Stack: → Terraform for infrastructure provisioning → ArgoCD for GitOps-based deployments → OpenTelemetry for distributed tracing and metrics → Tempo for trace storage → Grafana for unified visualisation The screenshot shows our command center : → Active services → Full trace visibility → Automated deployments with comprehensive health monitoring. Bottom line: Modern platform engineering isn't about choosing between automation OR observability. It's about building systems where both are inherent to the architecture. When infrastructure is code and telemetry is built-in, you get reliability, velocity, and visibility in one package. Curious how others are approaching this? What's your observability strategy look like in automated environments? #DevOps #PlatformEngineering #Observability #InfrastructureAsCode #OpenTelemetry #SRE #CloudNative

+7

5 Comments

Ricardo Castro

Director of Engineering | Tech Speaker & Writer. Opinions are my own.

11,780 followers 2y

Another SRE anti-pattern stems from not having adequate observability which is the practice of understanding how systems behave by collecting and analyzing data from various sources. Without adequate observability, SREs and engineering teams are essentially flying blind, making it difficult to identify, diagnose, and resolve issues effectively. Some of the problems and consequences associated with inadequate observability can be: - Increased Mean Time to Detection (MTTD): With inadequate observability, it takes longer to detect issues in your system. This can lead to increased downtime and negatively impact user experience. - Increased Mean Time to Resolution (MTTR): Once you detect a problem, troubleshooting becomes more challenging without proper observability tools and data. This results in longer downtime and more significant disruptions. - Difficulty in Root Cause Analysis: Without comprehensive data on system performance, it's hard to pinpoint the root causes of incidents. This can lead to "fixing symptoms" rather than addressing underlying issues, leading to recurring problems. - Inefficient Capacity Planning: Inadequate observability can hinder your ability to monitor resource utilization and plan for scaling. This may result in overprovisioning or underprovisioning resources, both of which can be costly. - Limited Understanding of User Behavior: Observability isn't just about monitoring system internals; it also includes understanding user interactions. Without this knowledge, it's challenging to optimize your system for user needs and preferences. What are some of the practices and tools that SREs can use? - Logging: Implement structured logging and ensure that logs are collected, centralized, and easily searchable. Use logging toolings like Elasticsearch, Fluentd, or Loki. - Metrics: Define relevant metrics for your system and collect them using tools like Prometheus or InfluxDB. - Distributed Tracing: Implement distributed tracing to track requests as they traverse various services. Tools like Jaeger and OpenTelemetry can help you gain insights into service dependencies and latency issues. - Event Tracking: Capture important events and errors in your system using event tracking systems like Kafka or RabbitMQ. - Monitoring and Alerting: Set up monitoring and alerting systems that can notify you of critical issues in real time. Tools like Grafana or Prometheus help in this regard. - Anomaly Detection: Consider implementing anomaly detection techniques to automatically identify unusual behavior in your system. - User Analytics: Collect data on user behavior and interactions to better understand user needs and improve the user experience. By investing in observability, teams can proactively identify and address issues, improve system reliability, and provide a better overall user experience. It's a fundamental aspect of SRE principles and practices.

2 Comments

Santiago Valdarrama

Computer scientist and writer. I teach hard-core Machine Learning at ml.school.

121,954 followers 4mo

If you can't see what an agent does, you can't improve it, you can't debug it, and you can't trust it. It's crazy how many teams are building agents with no way to understand what they're doing. Literally ZERO observability. This is probably one of the first questions I ask every new team I meet: Can you show me the traces of a few executions of your agents? Nada. Zero. Nilch. Large language models make bad decisions all the time. Agents fail, and you won't realize it until somebody complains. At a minimum, every agent you build should produce traces showing the full request flow, latency analysis, and system-level performance metrics. This alone will surface 80% of operational issues. But ideally, you can do something much better and capture all of the following: • Model interactions • Token usage • Timing and performance metadata • Event execution If you want reliable agents, Observability is not optional.

9 Comments

Barr Moses

Co-Founder & CEO at Monte Carlo

63,087 followers 1w

Three weeks ago, one of our agents routed hundreds of decisions based on data that had quietly gone wrong. No error. No flag. Nothing in the loop to surface it. We caught it because we built for observability before we built for scale. "Agent-first" is the new fetch. Most companies saying it haven't shipped a real agent into production. They've shipped a demo. At Monte Carlo, we're running three agents in our own operations. And customers like Axios are already monitoring all dimensions of agent reliability in theirs: context, behavior, outputs. Here's what we've learned actually breaks: — The data feeding the agent goes stale or drifts silently — The agent's behavior shifts without a model change — The output looks right but isn't The agent loop isn't the hard part. Knowing when it breaks is. The Princeton research co-authored by Sayash Kapoor and Arvind Narayanan on AI agent reliability found that across 14 agentic models, capability gains yielded almost no improvement in reliability. We've seen it firsthand. Production changes everything. A demo can tolerate a silent failure. Your operations can't. Build for observability first. When your agents fail, what are you actually monitoring: the model, the data, or both? #agents #AIobservability

Understanding System Observability

More in Understanding System Observability

More User Experience topics

Explore categories