Incident Management & DEV/OPS

Marcel Koert

Published Jan 13, 2025

Incident Management & DEV/OPS

A lot of people say that if we do DEV/OPS we do not need incident management anymore. This is not correct. In the best case if you have DEV/OPS you do not need a central Incident management team any more. I will explain in this article why you still will need incident management and why in some case there still is a need for a central incident management organization.

DEV/OPS way of looking at it.

Every Team in DEV/OPS is responsible for their own application, hardware, databases, etc.. So they are also responsible for the failures on all the components that make up their applications. So if a failure happens they will need to resolve this. The questions becomes how are they going about this. How are they going to mange this Incident. ( Incident Management ) So we do need incident management.

DEV/OPS incident management in practice.

For DEV/OPS there are only 3 categories / priorities of incidents needed.

The incident is so big that it needs addressing directly. This means drop anything you are working on and work until it is fixed. Breaks the sprint open with extra work.
The incident is bad and needs fixing but can wait until somebody finished the current work they are working on. Working hours only. Needs to be fixed within the sprint. Breaks the sprint open with extra work.
Can be picked up in the next Sprint.

Looking at this from a Organizational way.

Scenario 1 :

You have a organization / Business with 10 DEV/OPS teams so all these teams have nicely setup their incident management. They all have other definitions of their priorities. So what on team finds a prio 1 a other team finds prio 2. If this happens it is not a problem if they are working in isolation. But there are almost always connections between teams like infrastructure, databases or even direct API calls. When these connections are their and there is a mismatch in incident management you could get to a point where you are unable to meet your agreements with your customers.

So you could do without a central incident management team in this case as long as there was a effort to harmonize the ways the DEV/OPS teams responded to incidents. So somebody needs to write a Incident Management Guide for the DEV/OPS teams and you should be able to enforce it.

Scenario 2 :

A team finds a problem with their application but is unable to figure out where the problem is. Lets say they do not get a answer from a database. This could be a lot of things here are a few

Firewall is closed to the databases
Firewall is closed to the application from the databases
Network routing
Database is busy with long query
Database is down
Database is on different site
CPU of database server is down
Disk of database is down
NAS is down
Disk is broken
VM that the database is running on is down

And you can think of more of them. The thing is if you do not now where the failure is and you call in help from the first team you think that you need it could take some time to go down the list before you have the problem solved. And even if you are calling the correct people it could take time. Lets say the database is down. Who do you call.

Okay lets call a database expert in to see what happen. He is say that the database can not get to a disk on the linux machine.
Okay lets call the linux expert. Linux expert says he is unable to see the disk must be a Virtualization issue.
Lets call the Virtualization expert. Nope Disk is really not there lets call the Storage people.
The Storage people yes the Storage people say. Nothing is down the disk you are on is just to busy. We will switch your storage to a different disk.
Now you call the linux, database people again to get everything started again.
Problem solved.

Even if you did this in the day time it would take you a couple of hours if everybody is on the same page. So we need to be able to cut the time down how can we do that.

The only way is to have a central team that can coordinate things. So when a Incident comes in that a team does not understand or takes to long they can escalate to a central team that can then make people get to gather to solve this incident. In the case of the previous scenario that central team would have called all the teams at once until somebody said that these teams are not needed anymore or until the problem was solved. How you call this central team is up to you.

Root Cause analysis

One more thing to think about. If there is a incident you would always want to do a RCA. But who will facilitate the RCA. Studies have shown that you get better, more blameless, RCA’s when you have it facilitated a group that is not in the line of the incident process. This group is more objective and impartial.

Harmen Jonker 1y

Valid observation, Marcel. In ING we have a pretty efficient process designed and implemented for this, covering ING globally.

2 Reactions

Beri Kiran Kumar 1y

I agree

See more comments

To view or add a comment, sign in

Incident Management & DEV/OPS

Marcel Koert

More articles by Marcel Koert

Others also viewed

Incident Management Is Not Firefighting, it’s a Product Discipline.

Outage as Revealed Weakness: Thinking about Incident Management.

Incident Management as a Catalyst for Cultural Change

Incident Management- Toolkit

Building a Resilient Incident Management Playbook

Navigating the Quagmire: The Analysis Paralysis of Incident Management

Know More About Incident Management ?

Incident Management - a basic overview

Navigating the Storm: An Introduction to Incident Management Mastery

Explore content categories

More articles by Marcel Koert

The Competency Debt

The Feature Flag Graveyard

The Trouble With Chasing 100%

The SLA That Sales Invented

Reliability Is a Feature, Even If Nobody Put It in the Roadmap

Announcing MeloSlo Early Beta

SLO Bleed

Fail-Soft

bounded staleness

Limited and Fragile Context Handling