Observability Defined

Corneile Britz

Published Jul 26, 2020

Traditionally "monitoring" was a term reserved for Operations engineers, often a very grim reminder of using unsophisticated tools and approaches, perhaps even up/down switches.

That may have been true a decade ago, but things have changed at a dizzying pace since then, in my opinion primarily due to DevOps. Nowadays we have a new term in our midst, namely Observability.

It is not revolutionary, but perhaps the introduction of **statsd** by **Etsy** started the change. Allowing metrics to be implemented with a straightforward and simple approach, not resulting in tons of extra code, ensure the excuse of waste could not be used.

Observability consist of the following pillars:

Monitoring
Tracing
Logging

In this part, we will discuss the value of monitoring.

Monitoring is my solution of choice when starting any work to improve service and value delivery. The simple truth is that what is measure is improved, although the Google SRE book states:

*Your monitoring system should address two questions: what's broken, and why? The "what's broken" indicates a symptom; the "why" indicates a (possibly intermediate) cause. "What" versus "Why" is one of the most important distinctions...*

Teams starting on the journey are quite unsure of where to begin when they are used to relying on APM tools. The following order is perfectly sensible:

Inbound calls (APIs)
Outbound calls (Databases and services)
Service logic (Calculations)

The notion of making a system as visible as possible, uncovering any potential issues that may present itself at 2 am over the weekend, is 100% sensible until we experience the overwhelming noise. Receiving continuous notifications, zooming out to fit, and ultimately ignoring the metrics. And the solution is not to split the data and metrics.

The value of having "all" the metrics of the system, infrastructure, deployments, system and other components in the save aggregation platform is invaluable for the sole reason of transparency. Seeing all the trends of the platform, overlaid with events that introduced change, promotes improved diagnosis and response times for all teams. With that said, the notion still exist to not to share information with the same organisation, reason unknown.

Having all the information available from all the systems does mean that we all share the same information. We do off-course not have to share the same concerns, dashboards and alerts. In this case, we ensure that we continuously maintain our landscape, by removing items not used, refining calculations or updating notification channels.

Every alert does not need to go to everyone in the mailing list, but everyone should have access to dashboards.

Context is key to effective monitoring, achieved through tools such as Prometheus. This detail is lost when typical automated or out-of-the-box solutions are employed, such as the number of items to process in order. Invoices with 5000 lines cannot be compared to invoices with 2 lines, focused instrumentation surface this value.

My recommendation is:

Simple, predictable and reliable.
Start with recent failures or incidents that impact customers.
Add metrics to the Definition of Done
Understand that Observability is not a once-off task

Monitoring is likely the single highest return on investment!

To view or add a comment, sign in

Observability Defined

Corneile Britz

More articles by Corneile Britz

Others also viewed

Chapter 3: Monitoring Made Simple

🚨 User-Centric, Friendly PAM: The Hidden Risk in the Race for Digital Transformation

🚀 The Power of Performance Engineering in DevOps and DevSecOps and AI Ops

Day 19: Docker for DevOps (Part-3)🚀🚀

🚀 Monitoring & Logging in DevOps: Ensuring Application Health and Performance 🚀

Building a Production-Ready DevOps Pipeline with DevSecOps Principles

PCI DSS and DevOps: How to Embed Compliance into CI/CD Without Slowing Down IT

Why SRE is Not Just DevOps: Exploring the Unique Contributions of Site Reliability Engineers

Monitoring and Logging Solutions

How to Maximize Observability in Systems

Sales Call Metrics and Strategy for Outbound Teams

DevOps Metrics and KPIs

Explore content categories