How to solve your Performance Problem without increasing the capacity !!
One of the most captivating and immediate action is to increase the capacity either vertically or horizontally or both.
That's when we aka Performance Engineers need to jump in and lead the group into right direction.
Now what's this RIGHT direction!!
Isn't the problem has already been managed and smarty handled by production engineers??
Not only have they kept the system up and running but even future proofed it by adding capacity for contingencies.
And here, these few PE guys calling out for RIGHT direction. Whyyy ??
Yeah !! it'a sin to do RCA when production team got all the power to inflate the bills on name of saving the day.
No one is wrong !! At least entirely
As an immediate action and solution, production team got the production system running by adding more capacity but that should not be the showstopper for the PE guys and application team.
What could be done
Well , Now problem has to be ascended from top to bottom instead of bottom up approach as we go for during PE regression in agile way of app development and release.
Recommended by LinkedIn
Back to basics:
Ascertain two things first:
There is a hard chance for both of these basics to stay intact during the event.
You have to assess whether the low load code or query latency is acceptable or not. There is a good chance load tests failed to simulate actual prod behaviour.
Here, you have your first action item - work on the code and query optimization
Next steps , explore the problem occurring due to increasing concurrency
At this step, you shall opt for factors other then code and query latency.
With increasing load wait events, queuing, coherence starts to act up but only adding the capacity is not the only solution. In fact, adding capacity should be the last option.
Potential Performance Optimizations :
Have tried to jott down a potential path for any kind of performance problem but didn't went into details to explain any. Purpose and intent is only to share my confidence and gut feeling to optimize code, query and system before adding more capacity.
Thank you , happy reading, Hope you like and appreciate the thought process .
And auto - scaling makes the scene look very nice but adding up the bills
Every PE guy faces similar challenges in cloud world, oncall guys just increase the capacity and mitigate the issues. I agree with your top to bottom approach and monitoring prod. Cant really do a prod like load test and reproduce the issue many of the time. great write down Prateek Jain 👏 👍
I could resonate to this actual work starts after performance test execution, capacity is just one angle which can go wrong. Well writen Prateek Jain