Monitoring is about Quality NOT Quantity of Data

Monitoring is about Quality NOT Quantity of Data

I once wrote a blog many years ago titled “Why 99% of log data is crap” and was duly told not to publish it because it might upset another software vendor. Well, it turns out I was wrong. I ran the numbers last year when I analyzed log data from over 1,000 applications, the result was that 97% of log data turned out to be crap. Man, I felt so stupid.

I have always hated log files since my java development days. They were always full of crappy ClassNotFoundExceptions because developers couldn’t be arsed to clean up their library references and code. Running ./startWebLogic.sh back in the day was like reading pages of Harry Potter written in Japanese until you finally saw the words “<Server started in RUNNING mode>” 15 minutes later. How slow was Java in 2006?

Monitor everything - probably the stupidest words of all time. Monitor what is relevant. Monitor what matters. Monitor what is important. If everything is important, nothing is important. 97% of monitoring metrics are a commodity, they are readily accessible via APIs and are cheap to collect. I’m talking largely about OS and infrastructure KPIs like CPU, memory, disk I/O along with the standard perfmon, JMX/CLR counters and boring metrics that your dog/panda/cat/goldfish could collect.

Collecting everything is almost an epidemic right now in the world of monitoring. Cameron Haight from Gartner said something interesting a few weeks back at Gartner DC “Just because you are charting and measuring it, doesn’t mean it provides value”. Amen sister. Cameron’s point was about the need for metrics/data to be actionable. A spike on a chart is not actionable unless you know what really caused the spike, or more importantly, what the spike actually means in the bigger picture – think symptom vs. root cause.

Getting Quality monitoring data that delivers actionable insight is hard. I see log files and OS/infra metrics as mining copper data - no matter how hard you search its value will always be limited. As a comparative example, I see application run-time data as mining gold because its generally rare, and when mined properly can be extremely valuable. Over the past 5 years, I’ve always found it very odd why the likes of Splunk/Sumo/Elastic never built or acquired a BCI application run-time agent.

A classic recent example of this was Datadog announcing that it’s going to enter the APM market and take on the likes of AppDynamics and New Relic. Datadog started with infrastructure monitoring and collected every mofo metric on the planet. APM for DataDog is a pretty ballsie move if you ask me, but, it’s an obvious move and a move that other vendors like Signal FX, Wavefront and Sumologic must make if they are to grow and survive. Selling commoditized data which you can stream onto a chart in near real-time isn’t anything new. Wily Introscope did that back in 2004, albeit with a pants UI which took days to render in a large environment. My point is that collecting millions of data points looks pretty, but doesn’t deliver the context, insight or business value which is required these days to solve problems.

There is a sound reason why AppDynamics, New Relic and Dynatrace are kicking ass, and it’s not because they pride themselves on collecting OS or infrastructure metrics in real-time. It’s about the quality of data at the end of the day, quality data creates actionable information once you layer analytics on top of it. Metrics are either useful or not useful. Just because you can collect them doesn’t mean they unlock value.

Don’t get me wrong, I love a sexy pink and orange pie chart as much as the next DevOps dude. In the real world, and based on my experiences, these charts are only as good as the data and insight behind them. Sex sells but it doesn't solve problems, it just creates more of them when your app is on fire.

Steve.

@BurtonSays

Nice article Stephen, have seen this first hand many a time and have had to wade through the dross to find the occasional nugget . Actually, if 97% [or any sizeable % ] of the content of logs are crap/line-noise then this in itself is valuable - as a stick to go and [proverbally] beat those responsible for generating it in the first place and 'helping' them ameliorate the situation.

Like
Reply

97% of log data is crap...until you need it. I see three uses for log data. First, as you talk about, is monitoring the state of your systems. If you can configured your applications to only spit out information that's useful in your monitoring efforts, then great. But most of us have to have our external monitoring software filter through all the crap to alert us to important events. What you or I think is important may vary so it's critical that the monitor application be flexible enough to let us search for our own alert criteria. The second use is in forensics. Many times I've found vital clues to mysterious crashes or other problems by sifting through the mountains of crap. Sometimes it's quite an expedition but I'd rather have too much data than to be missing the one part I need. Third is for optimization. That's where developers take on the mission of reducing the amount of log data by removing the errors that are causing them. For an in-house staff that can be encouraged by sending them the logs of their errors daily and having management reviewing the logs and asking "why do we get these errors 2500 times every day?" The problem with collecting all the extra data is mostly one of network bandwidth and disk space, and secondarily processor load. All of which are relatively easy to expand today. I would rather collect too much and have it available then to skip some as-yet-unknown vital bit because I was discarding everything I didn't deem "important."

Great article and yes monitoring is not because we like to see as many graphs as possible but selecting the right ones. The ones that show you why you have a performance problem and how to improve response times.

Sorry but I have to disagree. The correct mantra is: monitor everything, collect data on everything, alert only on only what matters. I have been able to use the sheer volume of data to recreate accurate timelines of how and when small failures cascade on large failures. I have used such data to identify the trigger events of small and obscure bugs that created significant impact. And I have done those things dozens of times in the past few years. All because I had that data. The load difference of such monitoring is tiny, so the only real cost is in the sheer volume of data - and hard disks are cheap.

I 100% agree with this. I recently had a situation wherein we were parsing about 75% of data out of which on 3-5% were the alerts NOC actually acted upon. Later, we decided to discard all those unwanted data and therefore, we now have efficient and effective monitroing in place.

To view or add a comment, sign in

More articles by Stephen Burton

  • Who Has More Tools Than a CISO? The CRO and CMO.

    My first week as a VP of Marketing in 2014 went like this: Sign the Marketo contract. Hire a $700-an-hour Marketo…

    3 Comments
  • How startups can leverage Sagetap

    I started using Sagetap in 2019 with a simple use case, understanding our target buyers and where the incumbent…

    1 Comment
  • So, what is Smarketing?

    Normally when you join a series-A startup, job title is the least important thing to be worried about. Job titles mean…

    10 Comments
  • AppDynamics - The New Benchmark For Enterprise Software Startups

    It is very easy to be cynical and say “AppD got lucky with Cisco and isn’t worth $4 billion”. However, having been on…

    24 Comments
  • An APM Tragic Quad Rant

    Apologies first and foremost, its been a while since I last blogged and I've been meaning to publish this blog for…

    43 Comments
  • Yo' MoM is so Old

    A cheeky infographic dedicated to the old Manager of Managers (MoM)

    2 Comments
  • Event Aggregation vs. Correlation

    I had some rather large WTF moments last week after speaking with three enterprise monitoring teams. The biggest was a…

    8 Comments
  • Reducing Production Incidents and Outages with Machine Learning

    After working for almost a decade in the monitoring space, one thing has become clear to me - IT Operations does not…

    6 Comments
  • Single Pains of Glass, and Lack of Intelligence.

    One thing that really annoys me is marketing people using buzz words that they don’t really understand. Another is…

    18 Comments
  • Five Lessons I Learned from Tech Product Marketing

    Last week I got an email out of the blue from a sales rep at my last company who is interested in making the jump to…

Others also viewed

Explore content categories