Monitoring is about Quality NOT Quantity of Data

Stephen Burton

Published Dec 30, 2016

I once wrote a blog many years ago titled “Why 99% of log data is crap” and was duly told not to publish it because it might upset another software vendor. Well, it turns out I was wrong. I ran the numbers last year when I analyzed log data from over 1,000 applications, the result was that 97% of log data turned out to be crap. Man, I felt so stupid.

I have always hated log files since my java development days. They were always full of crappy ClassNotFoundExceptions because developers couldn’t be arsed to clean up their library references and code. Running ./startWebLogic.sh back in the day was like reading pages of Harry Potter written in Japanese until you finally saw the words “<Server started in RUNNING mode>” 15 minutes later. How slow was Java in 2006?

Monitor everything - probably the stupidest words of all time. Monitor what is relevant. Monitor what matters. Monitor what is important. If everything is important, nothing is important. 97% of monitoring metrics are a commodity, they are readily accessible via APIs and are cheap to collect. I’m talking largely about OS and infrastructure KPIs like CPU, memory, disk I/O along with the standard perfmon, JMX/CLR counters and boring metrics that your dog/panda/cat/goldfish could collect.

Collecting everything is almost an epidemic right now in the world of monitoring. Cameron Haight from Gartner said something interesting a few weeks back at Gartner DC “Just because you are charting and measuring it, doesn’t mean it provides value”. Amen sister. Cameron’s point was about the need for metrics/data to be actionable. A spike on a chart is not actionable unless you know what really caused the spike, or more importantly, what the spike actually means in the bigger picture – think symptom vs. root cause.

Getting Quality monitoring data that delivers actionable insight is hard. I see log files and OS/infra metrics as mining copper data - no matter how hard you search its value will always be limited. As a comparative example, I see application run-time data as mining gold because its generally rare, and when mined properly can be extremely valuable. Over the past 5 years, I’ve always found it very odd why the likes of Splunk/Sumo/Elastic never built or acquired a BCI application run-time agent.

A classic recent example of this was Datadog announcing that it’s going to enter the APM market and take on the likes of AppDynamics and New Relic. Datadog started with infrastructure monitoring and collected every mofo metric on the planet. APM for DataDog is a pretty ballsie move if you ask me, but, it’s an obvious move and a move that other vendors like Signal FX, Wavefront and Sumologic must make if they are to grow and survive. Selling commoditized data which you can stream onto a chart in near real-time isn’t anything new. Wily Introscope did that back in 2004, albeit with a pants UI which took days to render in a large environment. My point is that collecting millions of data points looks pretty, but doesn’t deliver the context, insight or business value which is required these days to solve problems.

There is a sound reason why AppDynamics, New Relic and Dynatrace are kicking ass, and it’s not because they pride themselves on collecting OS or infrastructure metrics in real-time. It’s about the quality of data at the end of the day, quality data creates actionable information once you layer analytics on top of it. Metrics are either useful or not useful. Just because you can collect them doesn’t mean they unlock value.

Don’t get me wrong, I love a sexy pink and orange pie chart as much as the next DevOps dude. In the real world, and based on my experiences, these charts are only as good as the data and insight behind them. Sex sells but it doesn't solve problems, it just creates more of them when your app is on fire.

Steve.

@BurtonSays

Glenn Stevenson 9y

Nice article Stephen, have seen this first hand many a time and have had to wade through the dross to find the occasional nugget . Actually, if 97% [or any sizeable % ] of the content of logs are crap/line-noise then this in itself is valuable - as a stick to go and [proverbally] beat those responsible for generating it in the first place and 'helping' them ameliorate the situation.

Rick Beebe 9y

97% of log data is crap...until you need it. I see three uses for log data. First, as you talk about, is monitoring the state of your systems. If you can configured your applications to only spit out information that's useful in your monitoring efforts, then great. But most of us have to have our external monitoring software filter through all the crap to alert us to important events. What you or I think is important may vary so it's critical that the monitor application be flexible enough to let us search for our own alert criteria. The second use is in forensics. Many times I've found vital clues to mysterious crashes or other problems by sifting through the mountains of crap. Sometimes it's quite an expedition but I'd rather have too much data than to be missing the one part I need. Third is for optimization. That's where developers take on the mission of reducing the amount of log data by removing the errors that are causing them. For an in-house staff that can be encouraged by sending them the logs of their errors daily and having management reviewing the logs and asking "why do we get these errors 2500 times every day?" The problem with collecting all the extra data is mostly one of network bandwidth and disk space, and secondarily processor load. All of which are relatively easy to expand today. I would rather collect too much and have it available then to skip some as-yet-unknown vital bit because I was discarding everything I didn't deem "important."

1 Reaction

Tjeerd Saijoen 9y

Great article and yes monitoring is not because we like to see as many graphs as possible but selecting the right ones. The ones that show you why you have a performance problem and how to improve response times.

1 Reaction

Noah Guttman 9y

Sorry but I have to disagree. The correct mantra is: monitor everything, collect data on everything, alert only on only what matters. I have been able to use the sheer volume of data to recreate accurate timelines of how and when small failures cascade on large failures. I have used such data to identify the trigger events of small and obscure bugs that created significant impact. And I have done those things dozens of times in the past few years. All because I had that data. The load difference of such monitoring is tiny, so the only real cost is in the sheer volume of data - and hard disks are cheap.

2 Reactions

Sulabh Chaturvedi 9y

I 100% agree with this. I recently had a situation wherein we were parsing about 75% of data out of which on 3-5% were the alerts NOC actually acted upon. Later, we decided to discard all those unwanted data and therefore, we now have efficient and effective monitroing in place.

Monitoring is about Quality NOT Quantity of Data

Stephen Burton

More articles by Stephen Burton

Others also viewed

Art of Logging

Performance Engineering for DBAs: My Complete Tuning Checklist

🔍 If Your API Call Looks Right but Still Fails, This Might Help

Solving Test Data Chaos in Performance Testing with Redis

Building Reliable, Scalable and Maintainable Applications

Write-Behind Logging - WBL

Testing with Production-Like Data – Secured by Azure Key Vault

Beyond Static Analysis: How CodeQL Changes the way we think about Application Security

Spring @Transactional: A Comprehensive Technical Breakdown

Explore content categories

More articles by Stephen Burton

Who Has More Tools Than a CISO? The CRO and CMO.

How startups can leverage Sagetap

So, what is Smarketing?

AppDynamics - The New Benchmark For Enterprise Software Startups

An APM Tragic Quad Rant

Yo' MoM is so Old

Event Aggregation vs. Correlation

Reducing Production Incidents and Outages with Machine Learning

Single Pains of Glass, and Lack of Intelligence.

Five Lessons I Learned from Tech Product Marketing