Oops of DevOps
Keep Calm - DevOps is here for rescue

Oops of DevOps

No matter how hard we try- any application that we develop will have some oops moment. And now when we work in DevOps or DevSecOps it becomes more challenging to handle those Oops without having a proper operation team who can deal end users (Customers) isues. So, it becomes important to come up with an Application Support model which is sustainable and economical, and also solves pain points of an engineer who is asked to support the application.

Pain Points

  • When a developer writes a code is he responsible to support L4-L3-L2-L1 issues in midnight?
  • Should all the developers be available or there should be a rota? If that’s the case how can I support APIs written by someone else?
  • What If a developer is on Leave or have an emergency?
  • Should there be a group or an Individual identified for Support?
  • What kind of tools should be used to make Operations easy
  • What practices should be followed to have less issues and reducing the operations overhead?
  • How SRE concept can be utilized?

Proposed Solution

  • Introduce Chaos Engineering practices - provide a platform to experiment and test the failures. Test infrastructure failures by introducing Chaos Monkey related tools and application failures with Simian Army in normal working hours to tolerate failures.
  • Design for failure (Thinking that the application will fail and prepare recovery plan) and Secure by Design (Mindset of Security vulnerabilities and plan to minimize the impact) should be followed.
  • Keep all the team members aware of all the APIs by having showcases, Sprint planning, rotation of developers from one API to another (every sprint) etc - Having more communication help them understand all the work team is delivering
  • API should generate enough logs to create different level of alerts and debug the issue
  • Must have logs analysis, Application performance monitoring and Infrastructure monitoring tools in place to generate alerts
  • When there is an alert, on-call engineers must get notified through channels such as SMS, voice call, push notification or email
  • Must have Incident Management platform tools such as PagerDuty or OpenDuty (Open source)
  • Have a rota - where you have weekly SRE- responsible for 24x7 support. In a team of 8 - each engineer gets to play SRE role for a week followed by 7 weeks off.

SRE (Site Reliability Engineer) or whatever you want to call them

  • Available to take calls
  • Should have read access to all the Monitoring and alerting portals (including logs)
  • Should be notified by SMS followed by an automated call
  • Must maintain a run book (Documentation of incidents - issues and resolution)
  • SRE should pick only half its capacity in the Sprint
  • There will be a lieutenant SRE - in case where SRE has to travel, sick or emergency.
  • If s/he gets an issue, after working hours, he should be given off (half day or full day depending on time he spend on the issue) the next day.

Next Steps up-to an organisation

  • What would be the compensation to the SRE if they spend night resolving an issue? - There should be an option to choose between monetary benefit and flexible day off.
  • How much compensation? – Should be consulted to HR and finance team. 


To view or add a comment, sign in

Others also viewed

Explore content categories