How to ensure the Highest Availability with EC2 Infrastructure in AWS

Lee Murphy

Published Jul 31, 2021

I have written a few little tidbits (braindumps) on Security and cost optimisation (which are 2 of the 5 pillars of the Well Architected Framework). I am now going to talk a little about my thoughts on Reliability (which is another of the 5 pillars). AWS provide excellent uptime Service Level Agreements for a large number of their services but qualifying for those SLAs requires a few decisions made by a customer/partner when designing or architecting applications and services within AWS.

In order to get the best outcome in the case of a failure within one component is to ensure that it is not the only component

So stick with me (I hope it wont be long and it may be fun). I will focus mostly on Infrastructure in this write up (I could go on for days if I expand to cover refactoring and architecting resilience into migrations so if anyone wants me to spend time on refactoring, shoot me a message).

Design For Failure

No one wants anything to fail but any Cloud architect worth his reputation assumes that things will. AWS offer 99.99% SLAs on some services such as EC2 but in order to qualify for that benefit Architects need to ensure that applications exists across Compute resources across 2 or more Availability zones within a Region.

AWS Define Availability Zones as a Data center or multiple Data Centers that do not share resources, power, underlying hardware components. They are isolated from any other Availability Zone within a Region and All regions have at least 2. As of writing Australia's Sydney Region has 3.

When I started in IT around 20 years ago Data centers I worked in were using pizza box blades hosting single Windows or Linux based servers that hosted more than one component bundled together (Web Server + Database, Front End plus mid Tier app, etc.). It made sense at the time as powerful hardware didnt come cheap. It makes no sense now and now the principle really should be, "Break an application down to as many components as possible and build reliability into each." Its a principle that is referred to as Decoupling, which simply means removing the dependency of a failure in one component so it wont impact others.

Failures are not always incidents. A failure can be the result of planned maintenance work in the backend that if a single point of failure in an application exists can become an outage that with a simple view across multiple Availability Zones can prevent

Exibhit A:

The above is pretty simple (not much complexity) but it shows how you can achieve a reasonably good level of High Availability and resiliance by using a Load Balancer, Auto Scaled AWS Environment and High Availability built into a Database cluster. As with most things "They have a tool for that" is the appropriate response to the sorts of questions that may come up.

AWS Elastic Load Balancing is a scalable AWS Service that can allow Public or Internal clients (this example is for Web so an Application Load Balancer would be ideal but others can also fit this narrative) and the targets split across at least 2 availability zones
Auto scaling Groups perform 2 important functions in this architecture. They can be self healing (ie replace a failed EC2 instance in an application architecture) or deliver dynamic response to performance needs that can be either predictable or unpredictable.
Amazon RDS or Amazon Aurora running in High Availability Architecture can ensure a database can operate in a failure of a single node (planned maintenance) or a failure of an entire AZ. It is certainly best practice to have a Highly available Database cluster for Production workloads and Aurora comes with scalability through Read replicas that can either be promoted to writer in case of a failure in the master or increase performance in the cluster if it is heavily read based

Cloud offers the tools to deliver well architected Solutions and the right partner or solution provider can help ensure that failures in a component or tier can be tolerated and the application will hum along

Scale for Resilience as well as performance

I do love Autoscaling Groups and I believe in a lot of multi tiered application there is a place for them. AWS have provided the foundations with Autoscaling Groups to build a Well Architected Load Balancing application that can scale for performance and recover from failure. If you are building a Web Application with EC2 as the primary compute source it would be my recommendation to make your front end web servers stateless (ie it doesnt store any part of the user's session on the Web Server itself). If you do that then you can pop the web servers inside an auto scaling group so that it just keeps on performing. If a web server becomes unresponsive the ASG will terminate and replace it.

The Use case for Auto scaling groups is when you can tolerate the failure of an EC2 instance and the loss of any data on that EC2 instance is not disruptive.

Build in Recovery Automation

My Job (a lot of it at the moment) is to work how to build something that fixes itself or manages itself. The word automation is used very very often in the IT sphere and with good reason. If you want to be able to manage workloads at scale (as Datacom, who are a premier MSP Partner in multiple geographies, do) then it is important to be able to automate where possible.

I thought maybe I would reference a few little nuggets that may help in thinking along those lines.

I have already spoken about Auto Scaling Groups but what happens if you simply use a pair of EC2 instances across 2 AZs that operate in parallel and one fails? Reboot right? What if you had dozens of Accounts, environments, etc. where that was a risk. It doesnt become practical to go in an manually respond to failure.

I refer to Exhibit B (AWS documentation) which is a solution that is now I think important to consider when architecting for failure. EC2 instance recovery

When an Instance fails an AWS Status Check (in the case of the above it is more targeted at System Status checks but instance status checks do work the same way) you can trigger a recovery action through a Cloudwatch Alarm. So it will automatically stop and start an instance that fails a check. If that happens once then there may not be a recurring issue to investigate but if it happens frequently then it may be. Notifications are received of these actions when configured so that the appropriate technical resource can investigate.

Cloudwatch has a great deal of power in this sort of scenario and another example of how we have architected such a solution is to enable automatic rebooting of an EC2 instance if it fails a Load Balancer health check. In that case we have 2 Nodes attached to a ELB and when one fails a health check an SNS notification is generated and a lambda function is triggered to reboot it.

Again as I keep saying these should be non disruptive actions to an application if it is built to withstand a single Virtual Machine failure.

Conclusion

Hopefully this was interesting and informative (and not too long). It is one of my favourite topics to discuss when I get my geek brain going and start breaking down an application a customer might use. One of my favourite analogies for Compute in the cloud is to not treat the EC2 instance like a pet that needs much love and nurturing. If you treat it like you can just destroy an instance and recover with no loss of functionality you will have a much more scalable and performant environment.

AWS provide hundreds of new features and capabilities every year and what is best of breed constantly changes but the mindset doesn't. Build your architecture on the assumption that parts of it will fail (through either unplanned disruptions or planned outages) and make sure that none of those failures disrupt the availability of your application and your experience in the cloud with AWS will be a great one.

To view or add a comment, sign in

How to ensure the Highest Availability with EC2 Infrastructure in AWS

Lee Murphy

Design For Failure

Scale for Resilience as well as performance

Recommended by LinkedIn

Build in Recovery Automation

Conclusion

More articles by Lee Murphy

Others also viewed

Building a Secure Infrastructure on AWS: A Step-by-Step Guide

AWS Infrastructure Provisioning with Terraform

Designing a Resilient, Scalable, and Highly Available Web Application Architecture (IaaS)

Webserver configuration on AWS EC2 using Ansible

Implementing Lambda function to automate the start and end function of EC2 instance

Building a Scalable and Resilient AWS Infrastructure with Public and Private Instances, Load Balancers, and Auto Scaling

Understanding AWS Route 53: A Scalable, Reliable, and Cost-Effective DNS Service

Securing EKS Cluster Endpoint with Private Access and EC2 Jumpbox

Daily AWS Solution Architect questions #10

Why serverless makes sense ...

Strategies for Building Reliable Data Center Zones

Improving Cloud Scalability with AWS Infrastructure

How AWS Simplifies Cloud Architecture

How to Prevent Runaway EC2 Instances

Strategies for Reliable Edge Data Center Engineering

Explore content categories

Design For Failure

Scale for Resilience as well as performance

Recommended by LinkedIn

Build in Recovery Automation

Conclusion

More articles by Lee Murphy

Amazon Q Apps - Generative AI to Make business more efficient

Securely Managing Amazon RDS Passwords with Wordpress on EC2

Optimised Data Protection within AWS S3

Evolving AWS Landing Zones in Datacom

AWS at Scale - Hub model and Control Tower

Optimising for Performance and Testing in AWS

Building a Secure Well Architected Landing zone in the Cloud Journey

Getting Value on the AWS Cloud Journey

Others also viewed

Building a Secure Infrastructure on AWS: A Step-by-Step Guide

AWS Infrastructure Provisioning with Terraform

Designing a Resilient, Scalable, and Highly Available Web Application Architecture (IaaS)

Webserver configuration on AWS EC2 using Ansible

Implementing Lambda function to automate the start and end function of EC2 instance

Building a Scalable and Resilient AWS Infrastructure with Public and Private Instances, Load Balancers, and Auto Scaling

Understanding AWS Route 53: A Scalable, Reliable, and Cost-Effective DNS Service

Securing EKS Cluster Endpoint with Private Access and EC2 Jumpbox

Daily AWS Solution Architect questions #10

Why serverless makes sense ...

Similar topics

Strategies for Building Reliable Data Center Zones

Improving Cloud Scalability with AWS Infrastructure

How AWS Simplifies Cloud Architecture

How to Prevent Runaway EC2 Instances

Strategies for Reliable Edge Data Center Engineering

Explore content categories