IT Operations Analytics (ITOA) - The data driven approach to IT Operations
As per Gartner's estimates, by 2017 around 15% of the enterprises will be using IT Operations Analytics (ITOA) on the huge volume of IT operational data to discover complex patterns and hence better insights, to reduce 'run-the business' costs or identify opportunities for 'grow-the-business' transformations. This, in effect, ushers a paradigm shift in IT Operations Management - from a current "tools driven" to a "data driven" approach. The current tools driven approach relies on silos of non integrated tools which fails to deliver the agility required to manage the ever growing complexities of hybrid cloud platforms and dynamic data centers.
Simply put, ITOA uses big data principles in extracting structured as well as unstructured data (Availability Data, Performance Data, Machine Data, Service Management Data) from diverse data sources (Application Logs, Monitoring Tools, Syslogs, APM Tools, Event Managers, ITSM Tools etc.), storing and indexing and finally correlating and analyzing them to get deeper business insights and proactive indicators about any potential or impending business impacting events.
Without getting too much into the technicalities let me summarize some of the best use cases that can potentially be addressed by ITOA :
1. Quickly find the root cause for a business outage : It is possible to crawl data from multiple sources and correlate them over a time-window at near-run time to get better visibility about the business impacting issues and quickly identify the layer attributable to the root cause. So whether the root cause is due to a recent change in a configuration item or a faulty load balancer or a highly utilized table space or high resource consumption by a rogue long running batch process becomes easier to identify. The traditional Event Managers lacks this powerful correlation features.
2. Topology based event visualization : Overlay the IT or Business Events on the topology map which does an application to infrastructure mapping, thereby providing powerful visualization about the business impacts and quickly identify a possible root cause. This improves operational visibility.
3. Identify repetitive patterns of issues : Identification and learning of patterns which leads to different kinds of business outages due to availability, performance or capacity issues. Subsequently these patterns can be used to predict an impending failure, before they actually happen.
3. Dynamic Baselining : Using machine learning techniques, it is possible to baseline the behaviour or consumption patterns of the different IT resources like business workload, cpu, memory, io etc. This information can be utilized in resetting the alerting thresholds or rebalancing the compute capacities.
4. Identify unauthorized changes : Unauthorized changes in configuration items without a valid RFC can be tracked and identified.
5. Ensure environmental consistency & release validation : Configuration consistencies can be tracked real time between (i) IT Assets under a cluster/ application group (ii) IT assets between DC and DR for the same application group. Also with faster release cycles, it is necessary to validate the correctness of the release in an automated manner to prevent any business outage immediately after the release. This becomes extremely important especially as we get into a hybrid and dynamic landscape with faster deployment cycles and move closer to the DevOps paradigm.
6. Identify business events : Business Events can be defined along the nodes of a Business Service spanning multiple applications. Rule bases thresholds can then be applied to detect and determine outages/ breaches for those business events.
7. Identify automation opportunities : From Service Management (Incident & Service Request) data, it is possible to identify patterns of high volume & low severity tickets which needs a procedural steps for resolution. These are ideal candidates for either shift-left or automation. Eg. password resets, user on-boarding, various provisioning requests etc.
8. Track pockets of End of Life(EoL)/End of Support(EoS) devices and their performance & impacts : In every IT landscape there are pockets of EoL/EoS assets, whose availability and performance needs close monitoring.
9. Track Operational effectiveness : The various operational metrics related to the standard ITSM and Infrastructure Management processes and their SLA/OLA's can be tracked. This can help to identify weak spots in the process and opportunities for their improvements.
10. Identify opportunities for rationalization/ optimization/ consolidation and transformations : Valuable insights can be derived from piles of complex data which can help in identifying opportunities for rationalization/ optimization/ consolidation of IT assets as well as transformation initiatives.
***
Disclosure : The author is an employee of Tata Consultancy Services Ltd. The opinions expressed herein are my own and do not reflect those of the company.
Sarthak Da, This is a good, to the point overview of the ITOA. This domain is always about live data & getting the meaning full insight out of humongous data points in any business domain. Today organizations are feeling the waves of cloud & big data & every one has to sooner or later planning to shift to cloud.
Very nice in summarized points..
Great article...completely aligned with...
Interesting and thought provoking read. This throws up many new possibilities in the world of Analytics and benefit the organisation(s) to pre-empt major failures in their Data centers.
good sarthak banerjee