Building Resilience for Cloud Native Apps

Brajesh De

Published Feb 23, 2021

What is Resilience?

Resilience is the ability to gracefully survive failures and bounce back to normal. A resilient system must thrive even when the unexpected happens. It needs to detect changing conditions and self-heal by taking corrective actions to ensure high availability. Advanced systems must learn from past failures to predict eventualities and act to avoid catastrophic failures.

Today every enterprise is looking to build resilient systems to meet their growing customer needs and expectations. Unavailability of systems impacts business revenue negatively. However, architecting resilient and highly available systems is non-trivial. It needs thoughtful considerations to identify and avoid failures in any component of the system that can have a cascading impact. In this article we would look at some architectural and design approaches that can be adopted to build resilience in a cloud hosted distributed application.

How to Build Resilience in a system?

Build a Culture to Design and Build Resilience

Architecting systems for resilience needs a change in culture and mindset. Firstly accept the fact that failure is normal. Hence start with the assumption that components in a system will fail. Think about all the components that can fail. Understand how a system behaves under adverse situations and identify root causes for failure. Architecture of resilient systems must detect failure and react to it. There must always be a backup plan to address failures.

System failure can happen at any of the following layers, viz.

Infrastructure
Network
Data
Application

Each of these layers are typically built and maintained by different teams in an organization. It is therefore important to instill a culture across teams to architect, design and build resilience at each layer for every component that can fail.

Resilience also requires a very high level of automation within the DevOps processes. There has to be automation throughout the application lifecycle management - right for infrastructure provisioning, application build and deployment, testing, monitoring applications, generating alerts and responding to them and taking corrective actions. There has to be a culture to automate every repetitive activity and remove manual interventions as much as possible.

Quality assurance teams should include automated resiliency testing as part of their test strategy. Testing resiliency is much more than testing for high availability. It needs a completely different mindset and approach than regular functional or non-functional testing. Chaos Monkey framework can be used to induce and inject failure into various components and test system behavior. If failure in any component produces any bigger cascading effect, address that at the earliest with the best possible solution.

Most importantly, senior executives must understand the business value and impact and be ready to make the necessary investments required to build resilience.

Build Resilient Infrastructure

Infrastructure is at the base of all applications. Hence design of all resilient applications must start with architecting and designing a resilient infra. Cloud technology has made infrastructure provisioning easier and cost effective. It provides a lot of capabilities that can be leveraged to build resilient infrastructure. One such capability is programmatically provisioning infrastructure through API calls - also known as ‘Infrastructure as a Code’

Infrastructure as a code approach provides benefits of repeatability with zero errors to build resiliency for Infrastructure. AWS Cloud Formation templates or Terraform, can be used to easily configure a complete datacenter - right from setting up VMs, configuring network and security to deploying the applications in a completely automated manner with no human intervention. This approach can help to automatically spin up a complete production environment within moments in case of any compromise, outage, human error or disaster. Adopting the principle of Immutable Infrastructure while using Infrastructure as a Code ensures that deployed servers remain intact, with no changes after deployment. It helps to speed up deployments. The principles of Immutable infrastructure provides the following benefits to build resilient infrastructure:

Infrastructure is consistent and reliable.
Deployment is simpler and more predictable.
Each deployment is versioned and automated, so environment rollback is a breeze.
Errors, configuration drifts, and snowflake servers are mitigated or eliminated entirely.
Deployment remains consistent across all environments (dev, test, and prod).

Design Resilient Applications

While enterprises move their workloads to the cloud for high availability, they must also design their applications to be cloud native using microservice with container technologies. While designing applications with microservices, there are several design strategies that can be adopted to sustain failures and further improve the overall resiliency of the applications. Depending on the business scenario one or more of the below design patterns can be applied to minimize or eliminate cascading impact.

Stateless Services - Design applications with stateless services. Stateless services are easier to replicate in event of any failure as they don’t have to hold any state information. Also since they don't use a lot of resources, they can be dynamically instantiated when the need arises - either due to heavy load or in case of failure. Elastic load balancers can distribute requests to the deployed instances of the application to improve availability and resiliency. Load balancers can be configured to stop routing traffic to instances that are down or non-responsive. Once instances return to service, the load balancer can start routing traffic to them. This way the consumer applications are not impacted by the unavailability of any application instance.
Event Driven processing - Event Driven approach helps to prevent cascading failure and isolate faults. A publish and subscribe communication paradigm with Event Driven processing brings in loose coupling between components and eliminates direct dependencies due to synchronous communications. This helps to ensure that the overall system availability is not compromised even when any of the backend processing components is down.
Timeouts - Timeout pattern is one of the simplest and the most common patterns for designing resilient applications. When a client connects to a backend service, there may be a situation due to which the backend service may be slow and delay its response. If there are too many requests during this period of slowness, all the client resources might get used up, while it is waiting for the response from the backend dependent system. This can cause a cascading impact bringing the entire system down. Configuring connection and read timeouts at the client end helps to release resources to the pool in case the backend system or database is taking more time than usual. It thus improves stability and resiliency.
Retry with exponential back-off - Sometimes failures may be transient and short lived for a few seconds. In such cases retrying a few times may help in getting the response from the backend system. However frequent retries may overload the backend system if it is already performing slow. To avoid overload it is recommended to increase the wait time between retries using an exponential back-off pattern.
Retry with Circuit Breaker - In case backend system failure is non-transient and long lasting, then applying circuit breaker pattern will be a better option as it can prevent all the useless retry execution. This helps to conserve resources and provide more immediate feedback to the callers.
BulkHead - A cloud-based application may include multiple services, with each service having multiple consumers. Failure in a service may impact all the consumer applications. Also the same consumer application may be sending requests to multiple other services. In such a case if the consumer application sends a message to a slow performing or overloaded service and is waiting for response, the resources used by the client's request may not be freed in a timely manner. If not addressed in a timely manner, this can result in all resources getting exhausted. To prevent this kind of scenario, a bulkhead pattern can be applied to isolate consumers and services from cascading failures by restricting the resource consumption for specific scenarios. This allows you to preserve some functionality in the event of a service failure.
Throttling - Throttling or rate limiting technique, limits the no. of incoming requests to be processed within a given time window This approach helps to control the throughput meeting the SLAs and conserve the resources utilization by only accepting as many requests that it can handle. Thus it helps to keep the service running without exhausting its resource capacity thereby improving availability and resiliency of the system

Deploy Applications with High Availability

One of the important aspects of building resilient systems is to ensure high availability. The most common approach for high availability is to build redundancy. Redundancy is the duplication of components of a system in order to increase the overall availability. It helps to provide a fallback in case of failure. Today cloud technology provides multiple options for the same. Below are some of the most commonly adopted approaches:

Availability Sets can help to protect against localized hardware failures and provide high availability. VMs in an availability set are distributed across multiple fault domains. So if there is power outage, hardware or network failure in a particular fault domain, network traffic would be distributed to VMs in another fault domain that is up and running. Availability set helps to increase the availability of applications within the same data center

Availability Zone refers to one or more discrete data centers with redundant power, networking and connectivity in a single geographic region. Multiple availability zones are connected through low latency and high throughput networking channels to maximize service availability and redundancy. Hence by deploying applications across multiple availability zones within a region helps to achieve higher fault tolerance and improves the overall availability and resiliency of the system.

Availability Region is a physical location around the world that has a cluster of data centers. Each region consists of multiple, isolated, and physically separate availability zones within a geographic area. By deploying applications across multiple availability regions helps to improve the availability In case of any natural calamity or disaster in a region.

Auto Scaling is another proven way to improve resiliency by enabling an application to scale up or down to meet demands.

Conclusion

Every enterprise wants all their systems and applications to be highly available and resilient. But do all systems need to be resilient and have the highest levels of availability? While long hours of outages and downtime is definitely not acceptable. Today intermittent error and partial failure is the norm rather than exception for a large scale distributed system in the world of cloud. Running in a partial failure mode is a viable option for most large scale systems except for life critical ones.

Significant investments are needed to build highly available and resilient systems. The cost increases exponentially with the degree of resiliency and availability required. Hence, striking a balance between cost of being highly available vs loss of revenue due to outages has the answer for the question. If the revenue losses due to outages is more than the cost, there is a good business case to invest and build for resilience using a combination of approaches outlined in this article.

Agnivesh Verma 5y

Good read and aptly pointed that resiliency should be built-in at every layer not only at Infrastructure where it usually gets the most traction.

To view or add a comment, sign in

Building Resilience for Cloud Native Apps

Brajesh De

What is Resilience?

How to Build Resilience in a system?

Build a Culture to Design and Build Resilience

Build Resilient Infrastructure

Design Resilient Applications

Deploy Applications with High Availability

Conclusion

More articles by Brajesh De

Others also viewed

Disaster Recovery Strategies for Cloud Computing

Mathematical Models for Recovery Time Objective in Cloud Applications

AWS Warm Standby vs. AWS Pilot Light: Choosing the Right Disaster Recovery Strategy

4 Ways Managed Service Providers Can Reshape Your IT Budget In 2021

Unlocking the Power of AWS Best Practices for High Availability and Disaster Recovery

5 AWS disaster recovery best practices

Cloud Disaster Recovery Strategy and Architecture for Business Continuity

Planning Resiliency of Cloud Applications: A Strategic Approach

Chaos Engineering in the world of SaaS & Cloud Computing

Resilience by Design: Multi-Region Infrastructure as Code That Actually Delivers

Building a Resilient Digital Infrastructure

Resilience Building in Autonomous Teams

Building Resilient Architecture for AI Travel Apps

DevOps for Cloud Applications

Cloud-native DevSecOps Practices

Building Resilience Through Change Management In Startups

Resilient Infrastructure Strategies for Modern Data Centers

Explore content categories

What is Resilience?

How to Build Resilience in a system?

Build a Culture to Design and Build Resilience

Build Resilient Infrastructure

Design Resilient Applications

Deploy Applications with High Availability

Conclusion

More articles by Brajesh De

AI Fatigue: The Hidden Cost of the AI Race

AI Agent Selection Framework for Business Leaders

From Testing Functionality to Validating Trust: The 6 Pillars of AI Assurance

Leadership for the Age of Generative and Agentic AI

Top 5 Generative AI Use Cases that Every Insurer Should Prioritize

From IT Services Power House to AI Creator Nation - A big leap in building India's AI Future

The Evolving value of Software Architects in the AI Era

Smart Kids and Silly Mistakes: Understanding How LLMs Grow Up

The Hidden Cost of Brilliance: Something That Every Leader Must Think About

The true cost of Generative AI isn’t in GPUs — it’s in the things no one budgets for.

Others also viewed

Disaster Recovery Strategies for Cloud Computing

Mathematical Models for Recovery Time Objective in Cloud Applications

AWS Warm Standby vs. AWS Pilot Light: Choosing the Right Disaster Recovery Strategy

4 Ways Managed Service Providers Can Reshape Your IT Budget In 2021

Unlocking the Power of AWS Best Practices for High Availability and Disaster Recovery

5 AWS disaster recovery best practices

Cloud Disaster Recovery Strategy and Architecture for Business Continuity

Planning Resiliency of Cloud Applications: A Strategic Approach

Chaos Engineering in the world of SaaS & Cloud Computing

Resilience by Design: Multi-Region Infrastructure as Code That Actually Delivers

Similar topics

Building a Resilient Digital Infrastructure

Resilience Building in Autonomous Teams

Building Resilient Architecture for AI Travel Apps

DevOps for Cloud Applications

Cloud-native DevSecOps Practices

Building Resilience Through Change Management In Startups

Resilient Infrastructure Strategies for Modern Data Centers

Explore content categories