Oops of DevOps

Kartikey Mishra

Published Oct 31, 2018

No matter how hard we try- any application that we develop will have some oops moment. And now when we work in DevOps or DevSecOps it becomes more challenging to handle those Oops without having a proper operation team who can deal end users (Customers) isues. So, it becomes important to come up with an Application Support model which is sustainable and economical, and also solves pain points of an engineer who is asked to support the application.

Pain Points

When a developer writes a code is he responsible to support L4-L3-L2-L1 issues in midnight?
Should all the developers be available or there should be a rota? If that’s the case how can I support APIs written by someone else?
What If a developer is on Leave or have an emergency?
Should there be a group or an Individual identified for Support?
What kind of tools should be used to make Operations easy
What practices should be followed to have less issues and reducing the operations overhead?
How SRE concept can be utilized?

Proposed Solution

Introduce Chaos Engineering practices - provide a platform to experiment and test the failures. Test infrastructure failures by introducing Chaos Monkey related tools and application failures with Simian Army in normal working hours to tolerate failures.
Design for failure (Thinking that the application will fail and prepare recovery plan) and Secure by Design (Mindset of Security vulnerabilities and plan to minimize the impact) should be followed.
Keep all the team members aware of all the APIs by having showcases, Sprint planning, rotation of developers from one API to another (every sprint) etc - Having more communication help them understand all the work team is delivering
API should generate enough logs to create different level of alerts and debug the issue
Must have logs analysis, Application performance monitoring and Infrastructure monitoring tools in place to generate alerts
When there is an alert, on-call engineers must get notified through channels such as SMS, voice call, push notification or email
Must have Incident Management platform tools such as PagerDuty or OpenDuty (Open source)
Have a rota - where you have weekly SRE- responsible for 24x7 support. In a team of 8 - each engineer gets to play SRE role for a week followed by 7 weeks off.

SRE (Site Reliability Engineer) or whatever you want to call them

Available to take calls
Should have read access to all the Monitoring and alerting portals (including logs)
Should be notified by SMS followed by an automated call
Must maintain a run book (Documentation of incidents - issues and resolution)
SRE should pick only half its capacity in the Sprint
There will be a lieutenant SRE - in case where SRE has to travel, sick or emergency.
If s/he gets an issue, after working hours, he should be given off (half day or full day depending on time he spend on the issue) the next day.

Next Steps up-to an organisation

What would be the compensation to the SRE if they spend night resolving an issue? - There should be an option to choose between monetary benefit and flexible day off.
How much compensation? – Should be consulted to HR and finance team.

Oops of DevOps

Kartikey Mishra

More articles by this author

Others also viewed

Platform Engineering: The Catalyst for DevOps and DevSecOps

What is DevOps?

The Key Differences Between DevOps vs DevSecOps

Service Transition, is it a Bridge, a Ferry Crossing or a barrier to DevOps? I thought it was a book.

How to Bridge the Divide between Developers, Security and DevSecOps

From Development to Security: Making Sense of DevOps vs. DevSecOps

Secure DevOps: The DevSecOps Way

Why Security Is Everyone’s Job - Especially DevOps

From Months to Minutes, Deploying with DevOps

Managing environments - How to resolve using DevOps principles (part 1)

Explore content categories

Thinking of SRE?

Jan 16, 2023

New era of software Architecture

Jun 6, 2021

Enterprise Integration | 10 Problems

Apr 30, 2020

Contingencies India - COVID-19

Mar 30, 2020

API Versioning - Approach

Feb 8, 2019

Leader Vs Manager

Oct 20, 2015