NullPointerException under load and Performance Engineering
All performance issues may not require a complex solution, sometime a simple hack might solve a complex performance issue.
NullPointerException aka NPE is quite commonly observed during unit/functional tests and relatively easy to find the root cause as it dumps the stack of method caused it with exact line number.
But observing NPE under load for a random data set and diagnosing root cause for the same may turn to be a nightmare for a Performance Engineer
This is article is about one of such experiences of mine while working on a performance issue
Application Background
Application (App A) under test was a java-based web application running in a tomcat container. Profile creation was one of the business-critical flows to be tested. Application (App A) makes a call to another application (App B) for some data validation during the flow
Issue Background
~10% of profile creations was failed under load and application log explained that the failure was due to NullPointerException
Root cause Analysis
The application (App A) creates a session as soon as a user logged-in and stores some information that is used across multiple steps (navigations) involved in Profile creation.
There was a HashMap used to keep specific information and stored the same in session. Null Pointer was observed when application (App A) code tries to get one specific value for a key during the flow and all 10% transactions failed while getting that information.
Application (App A, under test) was making a call to another application (App B) during the flow and this HashMap was passed as part of the call. A specific value of one of the HashMap keys became "null" just after the call to the application (App B). Debugging of Application (App B) code receiving HashMap showed that there was no manipulation done to the HashMap especially to the key which was becoming "null"
In this case, the entire HashMap has not become null, a specific key’s value became "null". Other keys of HashMap found to be not affected.
An interesting observation is that manual creation of profile/rerunning test using data, which was caused by NPE during a test, was succeeded.
NPEs observed only when >10 concurrent users executing flows. Single user or <10 users tests did not reproduce this error and a 100% pass rate was observed.
Debugging became very challenging as the issue was not reproduced during single-user profiling and under lower load.
We had only one clue about the issue was that failures due to NPEs were observed always just after the App A to App B call.
Solution
When we listen to this problem, the first solution that will come in mind is that adding a Null check to a particular key. But the object, which was becoming null, was required to complete to profile creation. And adding the Null check would have gracefully failed the flow but % of failure would have remained the same.
The second solution will come in mind is that using Concurrent HashMap to make particular HashMap thread-safe thinking that some other thread might have modified while current thread waiting for App B. Changing regular HashMap to Concurrent HashMap might not have solved the issue because as soon as application picks up the user request, profile creation is done sequentially and no parallelism or multi-threads involved. So using Concurrent HashMap might lead to poor performance when there is no synchronization required.
After breaking head for some time, I decided to apply a very simple hack to fix the problem after reviewing the App A and App B codes.
I understood that App B which received HashMap was not even reading/using the particular key which was becoming "null" after that call.
Also multiple debugging confirmed that the value of problematic key was never "null" before calling App B.
So I just made a copy of the problematic key’s value just before the App B call and restored it just after the call which resolved the issue with zero functionality breaking and no failures observed after that.
Happy Engineering!