Why CPU-Time and Optimization count ?
Let's talk a little bit about CPU-time, and how it can have a really big impact on your infrastructure pricing.
Of course, as a developer, I already know that features are more important than optimization... Unless they are not.
During my work as a Backend Developer/Solution Architect, I am constantly facing these specific problems:
- What methods to use for scaling ?
- Do I need to think about optimization or not ?
- Do I have time to optimize ?
- What are the limitations I am facing ?
These questions can be a big factor on the final pricing of an infrastructure (especially using Cloud infrastructures, that will probably autoscale).
The cost of undervaluing these questions can induce a massive increase on your infrastructure cost.
The Serverless prolem
One common example here is Serverless, let's list the pros and cons of AWS Lambda for example:
Pros:
+ No need to think about scaling ! It scales for you.
+ Cloud-based and integrated with nearly every AWS service.
+ Very adapted for Cron-type tasks.
+ Cost effective, because you don't pay any servers !
+ Cost effective, because you can save time on OPS !
That's what AWS will use to sell you Lambda.
Now, the cons:
- Not cost effective at all when having a massive amount of calls
- 15 mins of exec time maximum
- Slow start time (wamup) (on the baddest cases (mesured on Azure: 6seconds)
Managed services, like Lambda, can be a massive gain in term of OPS time, but you can loose money over the service cost. Always have that in mind when you are using managed services !
Now, let's talk about scaling.
Most modern applications are designed around REST-HTTP Based services. Sometimes they are monolithic, sometimes they are not (currently, Microservices have a very good reputation thanks to the new Cloud providers and OPS tools that are getting on the market). It really depends on the needs, the domains and the features of your application.
Scaling a Rest service is simple, design your application to be stateless, use user tokens to provide shared states across your infrastructure by using a cache (Redis etc.) and tadaaaa, you just need to have valid metrics to scale properly your application.
But let's now try a new scenario:
You have a WebSocket service that is used as a Notification service to your users.
The problem with a WebSocket (or any other "realtime open connection") service is that, by definition, you need to maintain your connection open with the application. Which means that scaling up is as simple as adding servers behind a load balencer, but scaling down is a pain (because, you will kill the opened applications connections).
There are ways to workaround this problem: Have a loadbalencer that will fallback the connection to another server, implement a reconnect-system on your application, yada yada yada.
For a simple Notification system, that's not a big deal, your user didn't have to know that they have been disconnected and awaiting notifications can be pulled from a notification API when trying to synchronize with the realtime notifications system (That's how Twitter, Twitch, Facebook and most of other big platforms do with your notifications).
But in some cases, you will need to have a state that can be reliable and retrievable for your user's session (The best example coming in mind: Video game servers).
Designing around these problems can be tricky, for example, how to maintain a user presence shared between servers without leaking data ? (if the server who is responsible for the created data dies ?)
I'm not here to answer this question because it heavily depends on the project, but it's a great example of why its important to think in advance about scaling (I could make an article about how we tackled this problem when I was working for Unexpected Studio later) !
So, to resume all of this: The more requests/users we can squeeze into a server, the less the infrastructure will cost.
Finally, all of this leads us to the big question:
Do I have to optimize ? why is my CPU-Time so precious ?
I will talk on the importance of optimizing Requests execution time as an example on how gaining some ms on your request's execution time can lead to a massive increase on requests you can handle, it highly depends on the configuration of your server.
Let's assume this configuration:
- We are using a single core server
- All request are synchronous (it will simplify our calculations)
- Our CPU-Time total is constant and will not be affected by environmental problems (for example, if our CPU runs at 1GHz, it will remain constantly at 1GHz without modifications at all, which is not the case in the realworld, because it heavily depends on environmental factors (such as temperature of the CPU)
- Finally, we are accounting here ONLY the execution time, and not all other request-time inducing features (such as ping latency, etc.)
So, let's imagine we have a simple Rest API. Our requests have an execution time of 300ms.
After charge tests, we have defined that the capacity of our server is around 20 requests/s before having a degradation of the service's response time.
Which means that one request is costing us 1/20th of our CPU-Time per second.
Let's now try the same calculation for 150ms request:
It's a simple division by 2, so we have 40 requests/s.
Imagine this scenario:
Each user we have generates 2 requests per seconds, which means that in the 300ms scenario, we will be able to have 10 users running on one server (Be careful, I am talking about concurrent users at the same second) when I could have runned 20 users on the same server if my requests would run for 150ms. And 30 if they were running at 100ms.
Let's assume a server instance costs 100$/month (which would be 0.14$/hour), and you want to be able to have 1k users concurrently (As an example, one of the biggest peak of Fornite CCU happened in 2018 at 3.4M CCU and they got database problems (source: https://www.epicgames.com/fortnite/fr/news/postmortem-of-service-outage-at-3-4m-ccu?lang=fr)), I am talking about Concurrent users at an instant T, so this cannot be used to have calculations about a whole month, but lets take our calculations to an hour of having 1k CCU generating 2 requests/s and the following requests exec time:
Users are generating a total amount of 2000 requests/s.
At 300ms response time: We would need at least 100 servers, which would cost us ~14$ per hour (around 10k$ per month)
At 150ms response time: We would need at least 50 servers, which would cost us 7$ per hour (around 5k$ per month)
And at 100ms response time: We would need at least 34 servers, which would cost us 4.76$ per hour (around 3.5k$ per month)
These are simple calculations but are often forgotten by backend developers and will impact the total cost of your infrastructure. And I am accounting here only the requests on the cost. But much more is needed to have a complete cost calculation (Bandwith, services used, etc.)
For fun, let's test the price with a Lambda-based infrastrucutre:
In EU, lambda cost is: 0.60$ per million calls and 0.0000166667$ per Go/s (which is a fancy way to describe the amount of memory used by your lambda per seconds, oh, btw, memory footpring can have a huge impact in your lambda pricing, be aware of that !)
Amount of requests generated by 1k users during one hour (@2req/s): 7200000 req.
Total execution time consumed by these requests during one hour: 2160000000ms (300ms), 1080000000ms (150ms), 720000000ms (100ms).
In my example, I am assuming that my Lambda will be provided with 512 Mo of memory, which will cost: 0.0000008333$ per 100ms consumed (source: https://aws.amazon.com/fr/lambda/pricing/)
Let's now run our calculations:
300ms: Requests call cost: 4.32$ per hour, Requests runtime cost: 17.99928$ per hour, total: 22.31928$ per hour, ~16 293$ per month
150ms: Requests call cost: 4.32$ per hour, Requests runtime cost: 8.99964$ per hour, total: 13.31964$ per hour, ~9 723$ per month
100ms: Requests call cost: 4.32$ per hour, Requests runtime cost: 5.99976$ per hour, total: 10.31976$ per hour, ~7 533$ per month.
Of course, you will need to adapt your strategy depending on your budget: Are you willing to loose money over the infrastructure but not on development cost because it would be even bigger ? That's your own choices ! In general, it will depend on the longevity of your Service, in most cases, for long-living services, its worth to limit infrastructure costs instead of dev cost.
The Web-Queue-Worker pattern
Okay Alex, I have understood that I have to be careful on my CPU-Time consumption, but, what can I do if I can't optimize more my code and still getting 300ms response time on my endpoint ? :(
Well, use the Web-Queue-Worker pattern.
The Idea is simple: To avoid slowing down your facing API, you can use a queue linked to a Worker that will do the heavy load instead of your API.
This will provide a response time very low for your user and he can retrieve the result after it have been processed by the worker ! (In fact, its a common rule to state that if an API endpoint consumes more than X ms, we will use this pattern (cf. google)), furthermore, it's even simpler to scale queued-tasks, and it will not generate an outage of service if there are too many tasks, they will just wait.
A very good article provided by Azure can be found here about this pattern: https://github.com/MicrosoftDocs/architecture-center/blob/master/docs/guide/architecture-styles/web-queue-worker.md
Know the limitations of your technologies !
Do not forget that most of the technologies you use as a Backend Dev (especially the data-based technologies such as MariaDB, Postgresql, MongoDB, Cassandra, ElasticSearch, etc...) have their own limitations !
I have a real-life example to share with this kind of mistake:
I cannot say very much about this experience, because I'm on NDA concerning it, but I can provide context.
I got affected on a project that was made by external developers. The project needed a profile-matchmaking (where you have to find similarities between profiles, such as how Tinder finds the best profile for you (Tinder uses Elasticsearch for profile matching btw)), and the developers that where working on the project decided to use MariaDB to store and search similarities of each matchmaking request (We are talking about checking 20+ variables to check for similarity).
Of course, if i'm talking about it here, it's because it was a huge mistake: The maximum throughput that we could obtain with a complete infrastructure was about ~4 req/s, even when adding new servers ! And it was heavily related to the way MariaDB handles requests and the complexity of these requests.
As an example, each requests was taking 8 SECONDS of execution time (but not consuming as much CPU: It was mostly due to row locking, with a total usage of 10% of the CPU)). In our charge test, the lock was long because it was comparing 15k entries for similarity (even with a massive amount of indexing).
The solution was simple: We reworked the code to use ElasticSearch (which is a Search Engine) instead of MariaDB. The results were stunning: instead of 8 seconds to find a result with 15k entries, we were around 30ms.
We even tested with 100k, 500k and 1M entries, and still stayed <300ms execution time (and it was highly depending on the amount of ElasticSearch nodes we had provided ! More nodes where having a huge impact on the response time predominately when more entries are stored).
Be aware of the technologies that exists and don't be focused on only few technology to answer a problem. Each technologies has its usages, and its important to keep informed about them. Its one of the developer's tasks to monitor and discover technologies that could help them do a better job !
Conclusion:
If we were in a perfect world, I would say "think your code optimized from the beginning". But it's not the case, and most of the time, even with experience and having optimization in mind, optimization is time consuming.
Some pattenrs will provide solutions for long-running requests (such as the Web-Queue-Worker) and can lead you to simplify your thinking processes, designs patterns are designed to answer a specific problem, you knowing them will make you gain a massive amount of time and your code will be easily understood by any other developer that knows them. The Microservices pattern is a good example for answering the problems we talked about in this article (because each service can wait for another execution asynchronously, and didn't use any CPU time while waiting).
Of course, my calculations and examples are really simple here, in the real world, way more variables must be taken into account, it's part of a Backend Developer job (even better: a Solution Architect, but most of the time, the Backend Dev is also a Solution Architect).
Just keep in mind, as a developer, that optimization, the lack of it and the choices of technologies can have a huge impact on the final cost of your infrastructure.
Happy Coding !