Let's Talk About Our Problems
At some point it is obligatory that I start talking about IT Service Management and ITIL and other fun stuff. This is professional social networking, after all. Most would start with Incident Management. It is a logical place to start the IT Service Management discussion and, if you were building a Service Management framework from scratch, a lot of people would suggest starting at incident.
But you aren't starting from scratch. Most organizations have at least some rudimentary approach to incident management and, despite the number of "is this an incident or a service request or a change" religious debates you will find in the various ITSM/ITIL forums... incident is pretty straightforward. See the incident, resolve the incident. Yes, I over simplify a bit, but in my defense many over-complicate. So, let's talk about problems instead.
Problems... everyone has them. The question is, how do we manage them? From an ITIL perspective, a problem is the cause or causes of one or more incidents and the purpose of Problem Management is to manage the lifecycle of problems. Well, that's what the book says. Like most things in ITIL, however, this is a general statement about the framework. To effectively perform Problem Management an organization has to make some decisions and implement some actual processes. Here are a few suggestions on how to get started:
To Do
Define the primary objective of the Problem Management process
In reality, you and your organization need to figure out what you want out of a Problem Management process. Check a box on your ISO certification audit? Communicate the causes of failures to stakeholders? Track known errors and workarounds? Reduce the number and/or impact of incidents? All of these are valid objectives, but they also will result in slightly or severely different processes. My favorite objective for Problem Management is "eliminate high severity incidents". Short and straightforward, and focuses the process on eliminating critical impacts to the business and customers. If your primary objective, however, is to generate documents of perfect prose thicker than train smoke that can be spoon-fed to senior leadership, that is also a perfectly valid objective... but it requires a completely different process than the one that focuses on actual remediation. Sure, you can have multiple objectives, but start with the primary objective. Design that process, then look at that process to see what changes would be required to achieve other objectives and evaluate those costs. So, pick your primary objective. Go ahead, I'll wait. Here are some suggestions to consider:
- Eliminate high impact incidents
- Communicate the causes of failures to stakeholders
- Comply with certification, legal, regulatory, or audit requirements
- Improve service levels
- Reduce outage times
Define the basic scope of the process
Unless you are dealing with unlimited resources, you can't conduct problem management for every incident. Additionally, whoever is executing the process needs defined inputs. You and your organization will need to decide where you are going to focus your limited resources to achieve the maximum results towards your primary objective. Of course, the primary objective will go a long way towards defining your scope. If your primary objective is to "eliminate high severity incidents", then "all severity 1 and severity 2 incidents" is a good start for a scope. If your primary objective is to communicate the causes of failures to stakeholders, then maybe "severity 1 incidents impacting the following services..." is the scope for you. My point is, the scope needs to be defined in a way that is concrete. I'm not saying you can't ever perform ad-hoc problem management on something that doesn't fit the primary scope. That is not only acceptable, it is encouraged (it means the problem process is effective and people want to use it)! I am saying that if you don't know the basic scope, it becomes very hard to estimate volume. By looking at the goal, the scope, the volume, and the resources available you will have a very good idea of what the actual process needs to look like. Your basic scope should be concrete enough that it can be turned into a query of your service management system. Some suggestions (pick one or more, they stack):
- All severity 1 and severity 2 incidents
- Recurring incidents. A note here: be prepared to define "recurring". Does that mean "3 incidents of any severity with the same resolution code against the same Configuration Item (CI) within 30 days" or "10 incidents of any severity with the same taxonomy against any CIs within 30 days". There are lots of options, but again it needs to be defined well enough to generate a query to the service management tool.
- Incidents of interest to stakeholders (defined by CIs, services, locations, etc.)
- Incidents impacting CIs in scope of certain regulations or audits (Payment Card Industry, regulated manufacturing, etc.)
Define the borders of the process
While technically part of the scope, the borders of your process need to be considered separate from the basic scope. What teams, organizations, vendors, suppliers, and customers will participate in your Problem Management process? Does your process stop at the edge of CIOs organization? Or does it stop at the edge of the company? Or will your vendors and customers be directly or indirectly involved? This is important to identify because, like the primary objective and the basic scope, it will greatly influence the volume and therefore the process itself. It will also define who needs to be involved in supporting the process, they will need to understand the objectives. Some suggestions:
- Only the IT teams will directly support the Problem Process. Root or contributing causes identified beyond the IT organization may be noted (possibly as Known Errors), but the problem handlers and problem managers will not drive them to remediation.
- The entire company will support the Problem Process, including non-IT organizations (manufacturing, customer service, regulatory compliance, etc.). The root cause of many incidents lie beyond technology and technology processes, maybe they can be found in the way users (internal or external) are using the technology... or in the processes that support those teams.
- Define how external organizations will participate in the process. Will technology teams manage root cause analysis from vendors, or will vendor management? Will customer service organizations manage the identification and remediation of causes that reside with customers?
There are terrible people who, instead of solving a problem, bungle it and make it more difficult for all who come after. Whoever can't hit the nail on the head should, please, not hit at all.
Do not
Please do not allow the problem process to become punitive
I can't say why this happens, but it does frequently. Many organizations allow Problem Management to become a cudgel wielded by one organization against another. The customer is disappointed by an outage, or the technical team is disappointed by a technology failure. In reality the impact was minimal, but the frustration and disappointment trigger a manhunt worthy of The Fugitive. I'm not suggesting complaints that don't fit the basic scope should be dismissed, I'm just saying that those complaints should be critically evaluated to determine if the Problem Process is going to add value to addressing them. If the impact to the organization was minimal, and there is no evidence that it is a recurring situation, you have to question if spending 40 effort hours on Problem Management and remediation is worth it, or if the requestor is just seeking a publicly documented mea culpa.
Skimp on governance... who can authorize ad-hoc problem management and who can authorize problems for closure
Along with managing the ad-hoc problem workload, a person or team needs to be designated to review problems that are proposed for closure. Has the goal been accomplished? Does the identified root cause logically match the circumstances of the incident? Do the proposed remediations address the root cause and contributing causes? Individual problem handlers may not always have the visibility to make these determinations. Sure, the network team can identify why a switch failed, but they can't really address how the raccoon got in there. My point is governance. The process needs that.
Accept "human error" as a cause
Many organizations tend to give up when they get to a cause (root or contributory) that is essentially "human error". This, as Admiral Ackbar would say, is a trap. Humans will always make errors, identifying this particular human's error will not remediate the likelihood of it happening again, even if that human is taken out of the process. Instead I encourage you to look past the error... what were the conditions that allowed the error to occur. For example, if the cause was "technician failed to execute byzantine process correctly", maybe you should focus on the byzantine process and not the human. Can the process be automated? Made simpler? Or maybe the process is required because the organization has refused to follow vendor configuration recommendations. There lies the "root" cause.
Additional suggestions
Train your team
The team(s) that will be supporting, managing, and executing the Problem Management process should be trained. Not only on the process itself and the service management tool(s) that will support the process, but also on how to do good root cause analysis. This is not a skill many people have, even technical people. A great place to start with basic root cause analysis is 5 Whys. 5 Whys is a simple technique that almost anyone can learn that generally produces very good root cause analysis results. It forces problem handlers to look past the most evident, proximate, cause and see the conditions that allowed that cause to exist.
Start with a good problem statement
If you start with a problem statement of "server xyz567 crashed", that is the problem your team will analyze and try to solve. If, instead, you start with a problem statement of "the billing system was unavailable for 6 hours during month-end processing", all of a sudden your team is tackling the root causes and all the contributory causes. Why was it unavailable? Why was it unavailable for 6 hours? Why did this happen at a particularly impactful time? All of those questions should be analyzed and answered... if your goal is the elimination of incidents. If your goal is stakeholder communication, maybe just telling the stakeholder why the server crashed is sufficient.
Determine measurements for success
All good processes need measurements for success, or service level objectives, or key performance indicators. Whatever you call it, you need a way to tell if the process you have stitched together is actually accomplishing the primary goal we talked about way up at the top. Some suggestions, correlating back to the primary goals above:
- % reduction in the number of severity 1 and severity 2 incidents
- % of incidents meeting stakeholder criteria where root cause is identified
- % reduction in audit or regulatory findings
- % reduction in downtime
Like most IT Service Management processes, Problem Management is what you make of it. ITIL provides a framework, but you have to fill in the blanks... and there are a bunch of them. But start with the basics and consider some of the suggestions above. When implemented well, Problem Management is one of the most effective and beneficial IT Service Management processes, ever.
Outstanding!