How to ensure the Highest Availability with EC2 Infrastructure in AWS
I have written a few little tidbits (braindumps) on Security and cost optimisation (which are 2 of the 5 pillars of the Well Architected Framework). I am now going to talk a little about my thoughts on Reliability (which is another of the 5 pillars). AWS provide excellent uptime Service Level Agreements for a large number of their services but qualifying for those SLAs requires a few decisions made by a customer/partner when designing or architecting applications and services within AWS.
In order to get the best outcome in the case of a failure within one component is to ensure that it is not the only component
So stick with me (I hope it wont be long and it may be fun). I will focus mostly on Infrastructure in this write up (I could go on for days if I expand to cover refactoring and architecting resilience into migrations so if anyone wants me to spend time on refactoring, shoot me a message).
Design For Failure
No one wants anything to fail but any Cloud architect worth his reputation assumes that things will. AWS offer 99.99% SLAs on some services such as EC2 but in order to qualify for that benefit Architects need to ensure that applications exists across Compute resources across 2 or more Availability zones within a Region.
AWS Define Availability Zones as a Data center or multiple Data Centers that do not share resources, power, underlying hardware components. They are isolated from any other Availability Zone within a Region and All regions have at least 2. As of writing Australia's Sydney Region has 3.
When I started in IT around 20 years ago Data centers I worked in were using pizza box blades hosting single Windows or Linux based servers that hosted more than one component bundled together (Web Server + Database, Front End plus mid Tier app, etc.). It made sense at the time as powerful hardware didnt come cheap. It makes no sense now and now the principle really should be, "Break an application down to as many components as possible and build reliability into each." Its a principle that is referred to as Decoupling, which simply means removing the dependency of a failure in one component so it wont impact others.
Failures are not always incidents. A failure can be the result of planned maintenance work in the backend that if a single point of failure in an application exists can become an outage that with a simple view across multiple Availability Zones can prevent
Exibhit A:
The above is pretty simple (not much complexity) but it shows how you can achieve a reasonably good level of High Availability and resiliance by using a Load Balancer, Auto Scaled AWS Environment and High Availability built into a Database cluster. As with most things "They have a tool for that" is the appropriate response to the sorts of questions that may come up.
Cloud offers the tools to deliver well architected Solutions and the right partner or solution provider can help ensure that failures in a component or tier can be tolerated and the application will hum along
Scale for Resilience as well as performance
I do love Autoscaling Groups and I believe in a lot of multi tiered application there is a place for them. AWS have provided the foundations with Autoscaling Groups to build a Well Architected Load Balancing application that can scale for performance and recover from failure. If you are building a Web Application with EC2 as the primary compute source it would be my recommendation to make your front end web servers stateless (ie it doesnt store any part of the user's session on the Web Server itself). If you do that then you can pop the web servers inside an auto scaling group so that it just keeps on performing. If a web server becomes unresponsive the ASG will terminate and replace it.
The Use case for Auto scaling groups is when you can tolerate the failure of an EC2 instance and the loss of any data on that EC2 instance is not disruptive.
Recommended by LinkedIn
Note: DO NOT use an autoscaling group if the EC2 instance failure would cause disruption to the application (ie if it stores user state or application dynamic configuration
And more recent enhancements to Auto Scaling Groups also permit the use of spot instances to operate part of the Group's capacity. Running ASGs with a mix of spot, On Demand and Reserved Compute consumption is often considered a good way to mix cost and performance needs together.
Build in Recovery Automation
My Job (a lot of it at the moment) is to work how to build something that fixes itself or manages itself. The word automation is used very very often in the IT sphere and with good reason. If you want to be able to manage workloads at scale (as Datacom, who are a premier MSP Partner in multiple geographies, do) then it is important to be able to automate where possible.
I thought maybe I would reference a few little nuggets that may help in thinking along those lines.
I have already spoken about Auto Scaling Groups but what happens if you simply use a pair of EC2 instances across 2 AZs that operate in parallel and one fails? Reboot right? What if you had dozens of Accounts, environments, etc. where that was a risk. It doesnt become practical to go in an manually respond to failure.
I refer to Exhibit B (AWS documentation) which is a solution that is now I think important to consider when architecting for failure. EC2 instance recovery
When an Instance fails an AWS Status Check (in the case of the above it is more targeted at System Status checks but instance status checks do work the same way) you can trigger a recovery action through a Cloudwatch Alarm. So it will automatically stop and start an instance that fails a check. If that happens once then there may not be a recurring issue to investigate but if it happens frequently then it may be. Notifications are received of these actions when configured so that the appropriate technical resource can investigate.
Cloudwatch has a great deal of power in this sort of scenario and another example of how we have architected such a solution is to enable automatic rebooting of an EC2 instance if it fails a Load Balancer health check. In that case we have 2 Nodes attached to a ELB and when one fails a health check an SNS notification is generated and a lambda function is triggered to reboot it.
Again as I keep saying these should be non disruptive actions to an application if it is built to withstand a single Virtual Machine failure.
Conclusion
Hopefully this was interesting and informative (and not too long). It is one of my favourite topics to discuss when I get my geek brain going and start breaking down an application a customer might use. One of my favourite analogies for Compute in the cloud is to not treat the EC2 instance like a pet that needs much love and nurturing. If you treat it like you can just destroy an instance and recover with no loss of functionality you will have a much more scalable and performant environment.
AWS provide hundreds of new features and capabilities every year and what is best of breed constantly changes but the mindset doesn't. Build your architecture on the assumption that parts of it will fail (through either unplanned disruptions or planned outages) and make sure that none of those failures disrupt the availability of your application and your experience in the cloud with AWS will be a great one.