Incident Management & DEV/OPS

Incident Management & DEV/OPS

Incident Management & DEV/OPS

A lot of people say that if we do DEV/OPS we do not need incident management anymore. This is not correct. In the best case if you have DEV/OPS you do not need a central Incident management team any more. I will explain in this article why you still will need incident management and why in some case there still is a need for a central incident management organization.

DEV/OPS way of looking at it.

Every Team in DEV/OPS is responsible for their own application, hardware, databases, etc.. So they are also responsible for the failures on all the components that make up their applications. So if a failure happens they will need to resolve this. The questions becomes how are they going about this. How are they going to mange this Incident. ( Incident Management ) So we do need incident management.

DEV/OPS incident management in practice.

For DEV/OPS there are only 3 categories / priorities of incidents needed.

  1. The incident is so big that it needs addressing directly. This means drop anything you are working on and work until it is fixed. Breaks the sprint open with extra work.
  2. The incident is bad and needs fixing but can wait until somebody finished the current work they are working on. Working hours only. Needs to be fixed within the sprint. Breaks the sprint open with extra work.
  3. Can be picked up in the next Sprint.

Looking at this from a Organizational way.

Scenario 1 :

You have a organization / Business with 10 DEV/OPS teams so all these teams have nicely setup their incident management. They all have other definitions of their priorities. So what on team finds a prio 1 a other team finds prio 2. If this happens it is not a problem if they are working in isolation. But there are almost always connections between teams like infrastructure, databases or even direct API calls. When these connections are their and there is a mismatch in incident management you could get to a point where you are unable to meet your agreements with your customers.

So you could do without a central incident management team in this case as long as there was a effort to harmonize the ways the DEV/OPS teams responded to incidents. So somebody needs to write a Incident Management Guide for the DEV/OPS teams and you should be able to enforce it.

Scenario 2 :

A team finds a problem with their application but is unable to figure out where the problem is. Lets say they do not get a answer from a database. This could be a lot of things here are a few

  • Firewall is closed to the databases
  • Firewall is closed to the application from the databases
  • Network routing
  • Database is busy with long query
  • Database is down
  • Database is on different site
  • CPU of database server is down
  • Disk of database is down
  • NAS is down
  • Disk is broken
  • VM that the database is running on is down

And you can think of more of them. The thing is if you do not now where the failure is and you call in help from the first team you think that you need it could take some time to go down the list before you have the problem solved. And even if you are calling the correct people it could take time. Lets say the database is down. Who do you call.

  • Okay lets call a database expert in to see what happen. He is say that the database can not get to a disk on the linux machine.
  • Okay lets call the linux expert. Linux expert says he is unable to see the disk must be a Virtualization issue.
  • Lets call the Virtualization expert. Nope Disk is really not there lets call the Storage people.
  • The Storage people yes the Storage people say. Nothing is down the disk you are on is just to busy. We will switch your storage to a different disk.
  • Now you call the linux, database people again to get everything started again.
  • Problem solved.

Even if you did this in the day time it would take you a couple of hours if everybody is on the same page. So we need to be able to cut the time down how can we do that.

The only way is to have a central team that can coordinate things. So when a Incident comes in that a team does not understand or takes to long they can escalate to a central team that can then make people get to gather to solve this incident. In the case of the previous scenario that central team would have called all the teams at once until somebody said that these teams are not needed anymore or until the problem was solved. How you call this central team is up to you.

Root Cause analysis

One more thing to think about. If there is a incident you would always want to do a RCA. But who will facilitate the RCA. Studies have shown that you get better, more blameless, RCA’s when you have it facilitated a group that is not in the line of the incident process. This group is more objective and impartial.




Valid observation, Marcel. In ING we have a pretty efficient process designed and implemented for this, covering ING globally.

To view or add a comment, sign in

More articles by Marcel Koert

  • The Competency Debt

    When Productivity Kills Mastery The most dangerous thing in a production environment isn't a bug; it’s an engineer who…

  • The Feature Flag Graveyard

    When safety tools turn into archaeological sites Feature flags were supposed to save us from the old ritual of software…

    2 Comments
  • The Trouble With Chasing 100%

    There is always someone who wants 100% reliability, 100% uptime, 100% certainty, 100% confidence, and ideally by next…

  • The SLA That Sales Invented

    There is a special kind of optimism that appears in technology companies right before engineering gets invited to a…

  • Reliability Is a Feature, Even If Nobody Put It in the Roadmap

    Somewhere in every organization, there is a roadmap bursting with ambition. It has glossy feature names, strategic…

    2 Comments
  • Announcing MeloSlo Early Beta

    My SLO strategy used to be “hope, dashboards, and a strong coffee.” Apparently that is not an official framework.

  • SLO Bleed

    Our runbook says reliability is a feature, but somehow the dashboard keeps interpreting that as “creativity is a…

  • Fail-Soft

    Why do SREs love “degraded mode”? Because “everything is on fire, but technically still serving traffic” is somehow…

  • bounded staleness

    Today we’re talking about bounded staleness. Yes, that’s right.

    1 Comment
  • Limited and Fragile Context Handling

    Why Bigger Context Windows Still Don’t Save Us From Ourselves For a while, the AI industry treated larger context…

Others also viewed

Explore content categories