Observability Defined
Traditionally "monitoring" was a term reserved for Operations engineers, often a very grim reminder of using unsophisticated tools and approaches, perhaps even up/down switches.
That may have been true a decade ago, but things have changed at a dizzying pace since then, in my opinion primarily due to DevOps. Nowadays we have a new term in our midst, namely Observability.
It is not revolutionary, but perhaps the introduction of **statsd** by **Etsy** started the change. Allowing metrics to be implemented with a straightforward and simple approach, not resulting in tons of extra code, ensure the excuse of waste could not be used.
Observability consist of the following pillars:
- Monitoring
- Tracing
- Logging
In this part, we will discuss the value of monitoring.
Monitoring is my solution of choice when starting any work to improve service and value delivery. The simple truth is that what is measure is improved, although the Google SRE book states:
*Your monitoring system should address two questions: what's broken, and why? The "what's broken" indicates a symptom; the "why" indicates a (possibly intermediate) cause. "What" versus "Why" is one of the most important distinctions...*
Teams starting on the journey are quite unsure of where to begin when they are used to relying on APM tools. The following order is perfectly sensible:
- Inbound calls (APIs)
- Outbound calls (Databases and services)
- Service logic (Calculations)
The notion of making a system as visible as possible, uncovering any potential issues that may present itself at 2 am over the weekend, is 100% sensible until we experience the overwhelming noise. Receiving continuous notifications, zooming out to fit, and ultimately ignoring the metrics. And the solution is not to split the data and metrics.
The value of having "all" the metrics of the system, infrastructure, deployments, system and other components in the save aggregation platform is invaluable for the sole reason of transparency. Seeing all the trends of the platform, overlaid with events that introduced change, promotes improved diagnosis and response times for all teams. With that said, the notion still exist to not to share information with the same organisation, reason unknown.
Having all the information available from all the systems does mean that we all share the same information. We do off-course not have to share the same concerns, dashboards and alerts. In this case, we ensure that we continuously maintain our landscape, by removing items not used, refining calculations or updating notification channels.
Every alert does not need to go to everyone in the mailing list, but everyone should have access to dashboards.
Context is key to effective monitoring, achieved through tools such as Prometheus. This detail is lost when typical automated or out-of-the-box solutions are employed, such as the number of items to process in order. Invoices with 5000 lines cannot be compared to invoices with 2 lines, focused instrumentation surface this value.
My recommendation is:
- Simple, predictable and reliable.
- Start with recent failures or incidents that impact customers.
- Add metrics to the Definition of Done
- Understand that Observability is not a once-off task
Monitoring is likely the single highest return on investment!