Chaos Engineering in the world of SaaS & Cloud Computing

Shankar Muniyappa

Published Jan 28, 2021

The 3-step approach listed below is based upon the experience of conducting chaos engineering in a wide array of applications, as well as having built Disaster Recovery (DR) solutions in the past. This is part-1 of a 2-part series that covers SaaS offering. I hope it adds value to your engineering journey.

What’s chaos engineering?

It’s a technique, where you automatically inject failures to an active system to study the impact and recoverable state of your system. The term “chaos engineering” originated from Netflix's internal practice.

Why chaos engineering & what type of system needs it?

Disasters are inevitable in the SaaS world. In the era of relentless engineering, we must foresee and anticipate the issues for managing reliable systems. The chaos exercise reduces any disaster’s blast radius and in many cases solves it without users experiencing it.
The systems, consisting of distributed services or components with dependencies, need chaos engineering to validate SLAs against disruption of one or many services.

How do we approach the chaos engineering exercise?

The 3 step approach laid out below, ensures that you optimize your effort and get the best results. The word “chaos” sounds crazy but it still has to be a well-planned & a controlled exercise.

“We still don’t know, what we don’t know yet”

1) Preparation: This is a coordinated effort from the engineering team. It defines the following:

Define steady-state metrics, that state the overall health of the system (Synthetic checks as experienced by customers).
Derive various baseline hypotheses against steady states mentioned above.
List all well-known disaster scenarios along with fixes (Triage post-mortems)
Identify all existing reusable code/tools required to support automation.
List all the tests, which cover a wide range of issues & real-world problems. Examples: (a) Data-center/region failures (b) Virtual/Hardware (c) Race conditions (d) Overall or individual services load (e) Dependency breakdowns (f) Functional bugs (g) 3rd party service failures
Establish (a) chaos-exercise flow-plan, (b) schedule, and (c) ownerships (SME’s). Have a template to record each triggered test plan.

“Your output is as good as your in-depth planning and focus”

2) War-Room exercise: Build a replica production environment to conduct all types of disaster exercises. This environment should not only emulate the production in infrastructure setup but also the load and traffic characteristics of your infrastructure. Adopt or build automated tools to conduct chaos exercises, which may vary depending on your tech stack. The chaos exercise on a replica should achieve the following objectives:

Chaos engineering operators should have in-depth knowledge of the systems. If not, this is the time to train them.
It consolidates tools(off the shelf?), playbooks & recovery process (automated)
It reduces the blast radius of the production environment, by fixing some of the issues.
It validates/invalidates a few hypotheses (do not completely disregard them yet)

“Real world events will showcase real world problems”

3) Live exercise: By this time you should have automation in place to conduct your chaos exercise in your production environment. Remember you are trying to break the system, but still stay within the rail guards of SLO and comply with your SLA. Few precautions to take.

SME’s for various services are available and are on-call
You follow the flow and prepare a factual report based on observed deviations from steady-state, associated triggered actions, and any tactical fixes.
Validate various metrics against SLO’s. Terminate chaos-exercise if it’s close to breaching SLO’s (again through automation). Revisit it after you get a fix for it.
Automation should clean up when you terminate/interrupt chaos exercise and ensure there are no zombie processes that are left behind.

“Mantra: Harder it gets to break, more stable is your system”

What’s the end game of chaos engineering?

Its objective is to build resilient systems, which will consistently improve SLI’s and ROI.

#chaosengineering #SRE #devopsworld #saasplatform #saasops #aws #cloudcomputing #cloud #devops

Coming soon Part (2of2)→ “Chaos Engineering in the world of IoT, Robotics and Edge computing”...

Co-Author: Sridhar Solur, GM|CPO|CTO - Robotics, IoT, SaaS

To view or add a comment, sign in

Chaos Engineering in the world of SaaS & Cloud Computing

Shankar Muniyappa

What’s chaos engineering?

Why chaos engineering & what type of system needs it?

How do we approach the chaos engineering exercise?

What’s the end game of chaos engineering?

Co-Author: Sridhar Solur, GM|CPO|CTO - Robotics, IoT, SaaS

More articles by Shankar Muniyappa

Others also viewed

White Paper: Azure Kubernetes Cluster Multi-Region BCDR Achieving Resilience and Global Scale

How a Fintech recovered from a critical outage in minutes with Velero

Infrastructure as Code & GitOps

Why Drift Scanning Matters for Keeping IaC as the Source of Truth in Day 2 Operations

How Does Ansible Works?

Monitoring and Logging Strategies in DevOps- Your Perfect Solution at NSS

Kubernetes in Production: The Real-World Engineering Blueprint for Resilience, Security & Scale...

Kubernetes Federation: Simplifying Multi-Cluster Management

Stop Copy-Pasting: Mastering Multi-Region Terraform with Provider Aliases

Explore content categories

What’s chaos engineering?

Why chaos engineering & what type of system needs it?

How do we approach the chaos engineering exercise?

What’s the end game of chaos engineering?

Co-Author: Sridhar Solur, GM|CPO|CTO - Robotics, IoT, SaaS

More articles by Shankar Muniyappa

Developing Leadership Traits in Action - Part 2/2

Lift-and-Shift: “My notes from the ground”

SRE: Key Insights-"Done the right way”

"Entrepreneurial Mindset: Why & How?"

"Salient Leadership Traits" - Part 1/2

Others also viewed

White Paper: Azure Kubernetes Cluster Multi-Region BCDR Achieving Resilience and Global Scale

How a Fintech recovered from a critical outage in minutes with Velero

Infrastructure as Code & GitOps

Why Drift Scanning Matters for Keeping IaC as the Source of Truth in Day 2 Operations

How Does Ansible Works?

Monitoring and Logging Strategies in DevOps- Your Perfect Solution at NSS

Kubernetes in Production: The Real-World Engineering Blueprint for Resilience, Security & Scale...

Kubernetes Federation: Simplifying Multi-Cluster Management

Stop Copy-Pasting: Mastering Multi-Region Terraform with Provider Aliases

Similar topics

AWS Cloud Engineering Best Practices

AWS SAA Exam Disaster Recovery Design Guide

Explore content categories