Chaos Engineering in the world of SaaS & Cloud Computing

Chaos Engineering in the world of SaaS & Cloud Computing

The 3-step approach listed below is based upon the experience of conducting chaos engineering in a wide array of applications, as well as having built Disaster Recovery (DR) solutions in the past. This is part-1 of a 2-part series that covers SaaS offering. I hope it adds value to your engineering journey.

What’s chaos engineering? 

It’s a technique, where you automatically inject failures to an active system to study the impact and recoverable state of your system. The term “chaos engineering” originated from Netflix's internal practice.

Why chaos engineering & what type of system needs it?

  • Disasters are inevitable in the SaaS world. In the era of relentless engineering, we must foresee and anticipate the issues for managing reliable systems. The chaos exercise reduces any disaster’s blast radius and in many cases solves it without users experiencing it.
  • The systems, consisting of distributed services or components with dependencies, need chaos engineering to validate SLAs against disruption of one or many services.

How do we approach the chaos engineering exercise? 

The 3 step approach laid out below, ensures that you optimize your effort and get the best results. The word “chaos” sounds crazy but it still has to be a well-planned & a controlled exercise. 

“We still don’t know, what we don’t know yet”

1) Preparation: This is a coordinated effort from the engineering team. It defines the following:

  • Define steady-state metrics, that state the overall health of the system (Synthetic checks as experienced by customers).
  • Derive various baseline hypotheses against steady states mentioned above. 
  • List all well-known disaster scenarios along with fixes (Triage post-mortems)
  • Identify all existing reusable code/tools required to support automation.
  • List all the tests, which cover a wide range of issues & real-world problems. Examples: (a) Data-center/region failures (b) Virtual/Hardware (c) Race conditions (d) Overall or individual services load (e) Dependency breakdowns (f) Functional bugs (g) 3rd party service failures
  • Establish (a) chaos-exercise flow-plan, (b) schedule, and (c) ownerships (SME’s). Have a template to record each triggered test plan.
“Your output is as good as your in-depth planning and focus”

2) War-Room exercise: Build a replica production environment to conduct all types of disaster exercises. This environment should not only emulate the production in infrastructure setup but also the load and traffic characteristics of your infrastructure. Adopt or build automated tools to conduct chaos exercises, which may vary depending on your tech stack. The chaos exercise on a replica should achieve the following objectives: 

  • Chaos engineering operators should have in-depth knowledge of the systems. If not, this is the time to train them.
  • It consolidates tools(off the shelf?), playbooks & recovery process (automated)
  • It reduces the blast radius of the production environment, by fixing some of the issues.
  • It validates/invalidates a few hypotheses (do not completely disregard them yet)
“Real world events will showcase real world problems”

3) Live exercise: By this time you should have automation in place to conduct your chaos exercise in your production environment. Remember you are trying to break the system, but still stay within the rail guards of SLO and comply with your SLA. Few precautions to take.

  • SME’s for various services are available and are on-call
  • You follow the flow and prepare a factual report based on observed deviations from steady-state, associated triggered actions, and any tactical fixes.
  • Validate various metrics against SLO’s. Terminate chaos-exercise if it’s close to breaching SLO’s (again through automation). Revisit it after you get a fix for it. 
  • Automation should clean up when you terminate/interrupt chaos exercise and ensure there are no zombie processes that are left behind.
“Mantra: Harder it gets to break, more stable is your system”

What’s the end game of chaos engineering?

Its objective is to build resilient systems, which will consistently improve SLI’s and ROI.

#chaosengineering #SRE #devopsworld #saasplatform #saasops #aws #cloudcomputing #cloud #devops

Coming soon Part (2of2)“Chaos Engineering in the world of IoT, Robotics and Edge computing”...

Co-Author: Sridhar Solur, GM|CPO|CTO - Robotics, IoT, SaaS

To view or add a comment, sign in

More articles by Shankar Muniyappa

  • Developing Leadership Traits in Action - Part 2/2

    In Part 1/2: Salient Leadership Traits(Published in 2018), we discussed the importance of possessing certain traits in…

  • Lift-and-Shift: “My notes from the ground”

    Here are my notes based on my experience of moving the complex on-prem (private cloud) stack to the public cloud. It…

  • SRE: Key Insights-"Done the right way”

    My take on SRE practice after having built diverse SRE teams, which managed high-velocity feature additions for SAAS…

    2 Comments
  • "Entrepreneurial Mindset: Why & How?"

    The success of any leader largely depends upon having an entrepreneurial mindset. So what does it take to develop such…

  • "Salient Leadership Traits" - Part 1/2

    Building leadership qualities starts early at home and within our immediate environment. When we think of leadership…

    2 Comments

Others also viewed

Explore content categories