How to solve your Performance Problem without increasing the capacity !!

How to solve your Performance Problem without increasing the capacity !!

One of the most captivating and immediate action is to increase the capacity either vertically or horizontally or both.

That's when we aka Performance Engineers need to jump in and lead the group into right direction.

Now what's this RIGHT direction!!

Isn't the problem has already been managed and smarty handled by production engineers??

Not only have they kept the system up and running but even future proofed it by adding capacity for contingencies.

And here, these few PE guys calling out for RIGHT direction. Whyyy ??

Yeah !! it'a sin to do RCA when production team got all the power to inflate the bills on name of saving the day.

No one is wrong !! At least entirely

As an immediate action and solution, production team got the production system running by adding more capacity but that should not be the showstopper for the PE guys and application team.

What could be done

Well , Now problem has to be ascended from top to bottom instead of bottom up approach as we go for during PE regression in agile way of app development and release.

  1. Make the problem statement clear
  2. Investigative evidence: i.e. Go gather diagnostic data from the production
  3. Dig in the logs
  4. Check profiler or APM data to identify the bottleneck. Never make the mistake of assuming that problem lies at same point where capacity has been added.
  5. Alongside, get the code execution and query execution profile during idle to low load period.


Article content

Back to basics:

Ascertain two things first:

  1. Code profile shows similar latency during the event and otherwise i.e. idle to low load phase.
  2. Query performance also remains more or less similar during the event and otherwise on any low load business day.

There is a hard chance for both of these basics to stay intact during the event.

You have to assess whether the low load code or query latency is acceptable or not. There is a good chance load tests failed to simulate actual prod behaviour.

Here, you have your first action item - work on the code and query optimization

Next steps , explore the problem occurring due to increasing concurrency

At this step, you shall opt for factors other then code and query latency.

With increasing load wait events, queuing, coherence starts to act up but only adding the capacity is not the only solution. In fact, adding capacity should be the last option.

Potential Performance Optimizations :

  1. GC tuning: Look for various GC options provided by current version of jdk and opt for latest GC algorithms like G1GC
  2. Connection pool optimization: You might have added more nodes but connection pool optimisation can help execute same workload without the need for adding more nodes.
  3. DB tuning: When we talk about DB tuning it's not about changing the query design instead there are lot of configuration options like adding indexes or rather say index optimisation, configuration of various memory like shared pool size, buffer area, checkpoint frequency etc
  4. Network tuning: Data compression is one of the best example of bandwidth tuning for better performance on same workload
  5. Tuning caching mechanism: Distributed systems heavily leverage caches in various manner so it's always fruitful to look into cache configuration and optimize cache size, object TTL etc.

Have tried to jott down a potential path for any kind of performance problem but didn't went into details to explain any. Purpose and intent is only to share my confidence and gut feeling to optimize code, query and system before adding more capacity.

Thank you , happy reading, Hope you like and appreciate the thought process .


And auto - scaling makes the scene look very nice but adding up the bills

Every PE guy faces similar challenges in cloud world, oncall guys just increase the capacity and mitigate the issues. I agree with your top to bottom approach and monitoring prod. Cant really do a prod like load test and reproduce the issue many of the time. great write down Prateek Jain 👏 👍

I could resonate to this actual work starts after performance test execution, capacity is just one angle which can go wrong. Well writen Prateek Jain

To view or add a comment, sign in

More articles by Prateek Jain

Others also viewed

Explore content categories