From the course: DevOps Foundations: Site Reliability Engineering

Unlock this course with a free trial

Join today to access over 25,500 courses taught by industry experts.

Introducing postmortems

Introducing postmortems

- Once services get restored after an incident, your job is only half over. - That's right, being a good SRE requires learning from failures and outages. And there's no better way to do that than postmortems, also known less gruesomely as incident retrospectives. - They function as the feedback loop from an incident back to product development and operations. You know but if they're so great, you know, why do we often avoid doing them? - Well, traditionally, when things go down, blame gets passed around. An investigation into an outage can end up being mostly about who's going to be accused of being responsible for it. - Ah, yeah, that, I mean, that sounds horrible. Of course people would want to avoid doing that. But then you lose the chance to learn from mistakes and improve on what went wrong. - Sure does and that's why one of the best things you can do is run blameless postmortems. In a blameless postmortem, engineers…

Contents