From the course: DevOps Foundations: Site Reliability Engineering
Unlock this course with a free trial
Join today to access over 25,500 courses taught by industry experts.
Introducing postmortems
From the course: DevOps Foundations: Site Reliability Engineering
Introducing postmortems
- Once services get restored after an incident, your job is only half over. - That's right, being a good SRE requires learning from failures and outages. And there's no better way to do that than postmortems, also known less gruesomely as incident retrospectives. - They function as the feedback loop from an incident back to product development and operations. You know but if they're so great, you know, why do we often avoid doing them? - Well, traditionally, when things go down, blame gets passed around. An investigation into an outage can end up being mostly about who's going to be accused of being responsible for it. - Ah, yeah, that, I mean, that sounds horrible. Of course people would want to avoid doing that. But then you lose the chance to learn from mistakes and improve on what went wrong. - Sure does and that's why one of the best things you can do is run blameless postmortems. In a blameless postmortem, engineers…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
Release engineering5m 12s
-
(Locked)
Change management2m 55s
-
(Locked)
Self-service automation4m 46s
-
(Locked)
SLAs and SLOs5m 21s
-
(Locked)
Incident management5m 43s
-
(Locked)
Introducing postmortems3m 29s
-
(Locked)
The postmortem process4m 3s
-
(Locked)
Troubleshooting5m 58s
-
(Locked)
Performance engineering5m 36s
-
(Locked)
Capacity and scalability5m 21s
-
(Locked)
Distributed design5m 2s
-
(Locked)
Deliberate adversity3m 57s
-
-
-