Move Over ITIL, DevOps is Here

Move Over ITIL, DevOps is Here

But Don’t Throw the Proverbial Baby Out with the Bathwater Just Yet.

DevOps, like Agile Development in the early 2000’s, has no one prescription.  It is not a program, a team, or project.  I always get a kick out of seeing the well known DevOps Venn diagram depicting Development, QA, and Operations, with the intersection labeled “DevOps”, because it represents whatever you want it to represent.  But it’s the ambiguous nature of that diagram that is exactly DevOps.  DevOps is a paradigm shift in an organization to deliver products faster, more predictably and with greater reliability, by granting “developers” greater control of the environment and building applications to support the infrastructure.  But getting there is every organization’s unique journey. 

Since the first DevOpsDays in 2008 there has been numerous best practices and hundreds of cloud and open source services to champion DevOps.  Today most would agree that automation of routine processes, continuous integration, continuous delivery, test automation are pretty standard fare when it comes to implementing DevOps.  But what about IT Operations?  Will “developers” support the applications going forward, fix all defects, detect anomalies, respond to outages 24/7, and interact with customers?  The short answer is, “yes”.  But there’s still a role for Service Operation.  It does however propose that Service Operation needs to transition from “operating” to “developing”.

In today’s complex business landscape, IT Operation is a crucial factor for providing the foundation for critical business systems.  This is especially vital in a fortune 50 technology company where the majority of customer facing business services are online, and where applications change by the minute.  With continuous delivery, content management systems that allow business users to modify html, style sheets, and construct new workflows, run Test and Target campaigns, and with “service offers” narrowing to almost each individual customer, applications are changing continuously.  Knowledge bases, “warranty periods”, handoffs, change control boards (CCB or CAB), etc., are no longer manageable.  For Service Operation to continue to deliver a stable infrastructure and protect the business from continuous change we must adapt, we must be the “developers” that will automate our routine processes and drive “operation” into the development lifecycle.

Imagine the following scenario:  James (a developer) finishes a piece of functionality and gets the ok to push the change to production.  The code is deployed on-demand via automated deployment processes (TeamCity, Octopus).  The new feature is now live, James validates the new functionality in production and takes a look at the success metrics.  Everything looks good so James is back to work.  As traffic to the website increases, the system detects an intermittent anomaly where existing customers with product type A, are causing an exception and although the exception appears handled, it is causing a slight decline in the conversion rate for that step in the workflow.  The system, based on business rules, determines this is an incident that is impacting orders, so it takes action.  Using segmentation the system changes the configuration for existing customers with product type A, to make the feature unavailable (SiteCore). Simultaneously, the system creates an incident assigned to James in the incident tracking tool (ServiceNow, Jira), pings James (Slack), and initiates a text to his mobile phone via an automatic engagement system (Pagerduty).  The system compiles a list of customers who may have been affected by the defect and sends the list to customer care.  Meanwhile James receives the text which reads “Existing customers with product type A, were receiving a null reference exception at line 55 on mainpage.aspx.   James fixes the error and redeploys the changes.  James updates the incident system, and the system removes the blocking configuration.  After 48 hours if the incident does not recur, the system will close the incident as “resolved”.

This is a utopian scenario which we all know is not reality, however it is not impossible to achieve a likeness.  For IT/Service Operation to participate in DevOps they should to begin with the ability to programmatically detect environmental changes and quickly correlate negative events so the cause can be addressed, ideally by programmatic correction or rolling back changes.  This will require transformation of all manually run queries used to investigate incidents, emailed status reports, business metrics & analytics, etc., to be “developed” into extremely granular alerts with sensitive thresholds, across all operational tools, which depending upon an organizations operational maturity can include performance and availability synthetics, network and resource monitoring, application and system logs, systems and application monitoring tools, business analytics, customer feedback, defect reports, change control systems, build and deploy alerts or logs, etc.  With centralized event management and granular alerting there will be many alerts going on during an incident, and the amount of “noise” would be overwhelming for manual intervention.  Correlation engines like BigPanda, MoogSoft, Ignio, are new applications that use data science and machine learning to group and correlate all of the events related to an incident. 

Big Data, Cognitive Systems, Automation, are making our “James scenario” a real possibility, but to take advantage of these advancements Operations needs new skillsets and mindsets, and we need to spend less time chasing defects and spend more time “developing” self-aware and self-healing applications, and once the culture is there, move onto the next chapter, “Predictive Analytics”.

I believe that we need both ITIL and DevOps - ITIL for slower speed, more stable, regulated, etc. part of IT and DevOps for the faster speed, fail fast, agile, flexible, etc. side of IT. BCG call it two-speed IT and Gartner calls it bi-modal IT.

Like
Reply

Great post! This identifies the exact issues we have faced in the past... Your Utopian scenario "had me at hello" !

Like
Reply

To view or add a comment, sign in

More articles by William L. Ferguson

Others also viewed

Explore content categories