Identify, predict and isolate problems faster

Identify, predict and isolate problems faster

This article is Part 2 of A practical guide to AIOps series. Here is a quick recap from Part 1 of this series. AIOps, simply put, is applying the power of analytics and machine learning to automate and improve the efficiency of IT Operations. The key AIOps use cases to increase operational efficiencies include:

  1. Identify, predict and isolate problems faster (Part 2)
  2. Analyze large amounts of operational data for root cause analysis (Part 3)
  3. Prioritize problems and recommend solutions (Part 4)
  4. Forecast capacity and optimize resource utilization (Part 5)
  5. Autonomous IT Operations (Part 6)

This article focuses on AIOps use case #1 to identify, predict and isolate problems faster. This use case helps improve IT organizations facing challenges like -

  • Many outages go unnoticed, until a customer complains
  • Outages take too long to isolate, and operators have to troubleshoot every IT system
IT Operators staring at dashboards
  • Lot of events (noise), resulting in important events getting ignored.
  • Operators staring at dashboards to ensure that all IT Systems are operational

AIOps solutions leverage analytics and machine learning, and need lots of operational data. The key success factor to identify and predict problems faster and accurately with AIOps is to monitor everything. There are three aspects to ensure that you are in fact monitoring everything:

  1. Monitor your full-stack from Infrastructure to Application
  2. Observe for logs, traces and metrics encompassing availability to performance
  3. Repeat 1 & 2 for all of your application dependencies

With all this operational data, AIOps can now quickly and accurately identify and predict problems. AIOps can reduce event noise by leveraging dynamic and statistical thresholds, and predict operational anomalies with custom or out-of-box problem signatures.

There are number of metrics that can be used to measure operational efficiency improved with use case #1. I find the following KPIs as most relevant for this use case.

  • Count of customer facing incidents
  • Mean time to detect / know (MTTD / MTTK)
  • Business agility metrics e.g. Count of change requests or tickets (exclude incidents)

The ideas, opinions and research presented in this article are my personal views on this subject. Please stay tuned to read more about the other AIOps use cases. Please leave comments and share your experiences, thoughts and participate in the conversation.

To view or add a comment, sign in

More articles by Maneesh G.

Others also viewed

Explore content categories