Achieving Scale with a Compute-Intensive Application on AWS

Here's a short article on how an old team of mine managed to achieve very significant scale on AWS for a system involving a lot of heavy, batch type computing. Though this is AWS-centric, other tools could be used as well - I'm just running with AWS because that's what we used and the project worked out very nicely, though I am an AWS fan. I don't work for AWS nor do I have anything to gain by promoting their services.

Introduction

It seems that startups and smaller companies often MVP a feature and then if that feature does well in the market, they're left scrambling to figure out how to make that thing scale. Although this approach does not appeal much to me as a software engineer, I certainly get why startups often do it this way. On the other hand, bigger companies who know they need scale right from the start can plan for such things. Here's how we did it on a previous team I worked on at a big tech company where scale was of paramount importance, but it’s not complicated and there’s no reason a smaller shop couldn’t do the same, especially given generous free-tier services and credits for startups that AWS and other cloud compute vendors offer.

None of this is groundbreaking stuff and all of it can be found in much more complete detail in something like AWS' Well-Architected Framework. We also leveraged A Cloud Guru to learn a lot about AWS as we got started and they speak a lot about this kind of thing as well. I highly recommend checking them out for either AWS or other cloud computing technologies that you’re interested in.

Our Solution

Our solution was pretty simple:

  • Use a message queue.
  • Use compute-optimized services for the heavy lifting.
  • Use a notification system for additional orchestration.
  • Use a service optimized for quick tasks and high scale to handle things like the API.

Use a message queue

We used Amazon Simple Queue Service (SQS), a fully managed message queue. Rather than having our API handlers doing any kind of heavy lifting, we enqueued longer-running tasks and responded quickly with an acknowledgement. Whenever part of the system required significant processing from another part of the system, it went through the message queue.

Use compute-optimized services for the heavy lifting

We used Amazon Elastic Container Service (ECS) running on EC2. You can also run ECS on Fargate (serverless compute), though pricing may be a consideration. We used a combination of reserved instances of EC2 as well as on-demand. The reserved instances made for a base of lower-cost compute while the on-demand gave us some elasticity. ECS auto scaling allowed us to easily scale up and down. This was a few years ago, so these days using Amazon Elastic Kubernetes Service (EKS) might be preferable though ECS is certainly still a viable option and probably easier to get started with while still allowing you to containerize your application components. Interesting side note: I hear that the majority of people using Kubernetes on AWS are not running it with EKS.

Tasks from the message queue were picked up by the services running in ECS and processed at a time convenient to the overall system. This way a flood of incoming requests would not result in the system becoming unresponsive or experiencing more significant downtime and we could scale up and down according to need.

Use a notification system for additional orchestration

We used Amazon Simple Notification Service (SNS) to orchestrate the system. SNS messages were used to let workers running in ECS know that they should look for new tasks in the message queue and also to let the rest of the system know when tasks were completed. We used SNS for other smaller things aspects of orchestration as well, but I'm focussing more on the compute-intensive part of the solution here.

Consider a service optimized for quick tasks and high scale to handle things like the API

We used AWS Lambda and API Gateway to serve an API. Since all the heavy-lifting was done in ECS/EC2, the handlers were all quite simple - mostly encoding the inbound requests as messages in SQS and providing a quick response, usually including some kind of ID the user (in our case, an adjacent system implemented in an entirely separate network) could use to follow up on the status of the job. Lambda is great for this because you don't need to set up another service or worry about scaling and pricing basically scales right down to zero.

Counter Argument

One reason why you might not want to use the above is that you want to steer clear of cloud vendor lock-in. Fair enough. Maybe that means you want to build your application entirely in something like Kubernetes and leverage only open-source components and not use any of the bespoke components that AWS or other cloud vendors offer. It’s a valid, though complicated argument involving a lot of tradeoffs. An interesting (and perhaps incorrect) observation I have though is that big, successful, scalable applications often end up on AWS regardless of where they start (assuming they end up in some public cloud, that is). The same is probably also the case with Azure for Windows-oriented .NET type applications, in which case the concepts of the above can likely be re-mapped to Azure services.

Final thoughts

The above-described, simplistic architecture is not complicated and scales very nicely for this kind of system. AWS uses the term "decoupling" to describe using a message queue along with the other components in your system in order to scale and I think this terminology makes sense. If I was MVPing a new idea or launching a startup involving this kind of compute-heavy backend processing, I would consider adopting something like this right from the start or at least very early on, even if it is a bit more up-front work.

As a developer, working on a project such as the one described here will go a long way towards getting an AWS certification which appears to be highly valued in the industry. I ended up getting the AWS Certified Developer (Associate) certification without a huge amount of additional work - just going through the relevant A Cloud Guru courses and doing some practice exams. I noticed a large up-tick in the amount of inbound messages I get from recruiters since getting my certification, but maybe there's other factors involved as well.

Feedback

Let me know if this is of interest to you. I may write another article about my Rust learning journey thus far. Or I may write about my free online music theory "game" which is essentially my playground for trying out new technologies mixed with my interest in music. I use quotes for "game" because, unless you're seriously into the mathematical side of music, you probably won't find it very fun. The game has been around for quite a while in various incarnations and I somewhat recently re-wrote the backend in Rust so I can talk about how insane the performance is now as well as how nice Rust is for modelling the kinds of math-ish musical concepts which are at the core of the game. And I can also talk about the AWS components I'm using for it as well as how I'm statically hosting the frontend on S3 and delivering it over CloudFront.

And finally, I'm curious what you think is the right place for these kinds of articles? It seems like LinkedIn has become more for corporate marketing and I don't want my stuff to be mis-interpreted as that, so I'm not convinced this is the right place for it. I've wondered about GitHub Pages, but I'd like to be able to get some feedback. Maybe Medium. What do you think? And what do you think of the depth of this article? Thanks for your input.


To view or add a comment, sign in

Others also viewed

Explore content categories