Enterprise Java - Virtualisation Essential Practices
The Virtualisation Problem
In most virtualisation solutions (including VMware), several memory management techniques are employed to dynamically reduce the amount of physical memory required for each virtual machine (enabling memory over-commitment in the host OS etc). Even when the host isn't over-committed, these techniques (such as ballooning) are still active. This can be a great efficiency technique for many low-to-medium server load requirements. However, it is not optimal for Java - which is very memory sensitive.
Image Source: http://developer.amd.com/resources/java-zone/java-zone-archive/optimizing-java-performance-in-a-virtual-machine-environment/
As Java runs in its own virtual machine (JVM), it will exhibit performance problems if anything causes its memory operations, especially "Garbage Collection" (GC), to slow down (or if its memory, including heap and overall process, is incorrectly sized). For example, memory over-commitment causes more memory pages to move between physical RAM and disk, or between sections of memory in the host and guest OS. Therefore, Java's performance is likely to degrade by an order of magnitude if this occurs. Anything that introduces (even nominal) delays in the speed Java can manage the heap will dramatically change its potential throughput. Additionally, the typical GC configuration of a JVM is designed to be as passive as possible (limited threads etc. - especially with regards to minor collections), only a fraction of the available CPU is used (until it's forced to repeatedly attempt full collections).
When the total of the committed heap and other memory comes close to the VM's vRAM setting, the operating system resorts to page swapping, resulting in possible performance degradation. It is, therefore, important to size the VM to accommodate the sum of the maximum committed heap size and the "other" memory.
Because Java uses a different type of memory management when compared with other typical (native) server processes and that whoever is managing the virtualised infrastructure (usually) only has a limited understanding of Java, it can be very difficult to convince them that there is a performance issue caused by the VM configuration. This is understandable as all of this activity is essentially hidden from them and they won't have any obvious metrics to indicate that there is a problem.
Common Java Performance Symptoms
Modern JVM's are fast, very fast - reaching and, in some cases, exceeding the performance of native applications written in languages like C. If it's not fast, there is probably something wrong with your implementation.
java.lang.OutOfMemoryError: Java heap space
This is the most common memory error in Java - often just the result of a inappropriately sized (too small) heap or an application memory "leak". Obviously, neither of these aforementioned issues are caused by virtualisation. Instead, we are interested in this issue when the logs indicate a significant amount of GC activity - where the OOME has been caused by the JVM's inability to free-up unused objects from the heap quickly enough (but where the application can otherwise start and operate normally with the current heap size).
Interestingly, this can occur before the JVM reaches the prescribed maximum heap setting. This will typically be caused by the JVM being unable to grab enough committed memory from the OS quickly enough to avoiding filling the current heap space. This lazy commitment can be an issue for Java (and is one reason for not setting up regular restarts of your applications). VM's can also lazy-commit guest OS memory - compounding the issue.
The JVM heap can vary its Current heap size between two preconfigured memory boundaries – the initial heap size (defined by the –Xms option) and the maximum heap size (defined by the –Xmx option). The operating system, however, will commit memory to the heap lazily as it is used, so this results in a number of different memory metrics that must be considered.
Quote Source: http://pubs.vmware.com/vfabric52/index.jsp?topic=/com.vmware.vfabric.em4j.1.2/em4j/conf-heap-management.html
Long GC Pause Times
If a major collection ever takes longer than a few seconds, you have big problems. If it ever reaches tens of seconds, you have very big problems. If it exceeds that, you're probably reading this because you've had a recent outage...
I've seen examples where given a 50 GiB heap size and different GC configuration (under the same load), the maximum pause times can be limited to less than 200ms from over 30 seconds! The G1 collector has made it easier to get a high performance configuration out of the box, but it's still no silver bullet.
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
There is an important distinction between this issue and your typical OOME.
This message means that for some reason the garbage collector is taking an excessive amount of time (by default 98% of all CPU time of the process) and recovers very little memory in each run (by default 2% of the heap). This effectively means that your program stops doing any progress and is busy running only the garbage collection at all time.
Quote Source: http://stackoverflow.com/questions/1393486/error-java-lang-outofmemoryerror-gc-overhead-limit-exceeded
The easiest way to recreate this issue is to create a Java process larger than the available physical RAM, then subject the application to sustained high load - heavily utilising the entire heap allocation.
Recommended Solutions
On the face of it, running a (Java) virtual machine inside another virtual machine appears less than optimal. A knee-jerk reaction is often to propose that dedicated physical hardware is used in place of virtualisation (for the Java estate). In actuality, a well configured installation can provide near-physical performance, whilst taking advantage of most of the additional benefits that virtualisation offers.
VMware is very aware of the performance implications of running a misconfigured host, so has itself published several best practice guides and recommendations:
The host is only able to see the working set of the memory in a VM. Given the nature of Java workloads in using some parts of the heap very intensely and other parts of the heap sporadically, it can be difficult for the hypervisor to judge how much memory is really free in a VM running Java
Quote Source: http://pubs.vmware.com/vfabric52/index.jsp#em4j/conf-ballooning-vs-right-sizing.html
These points are the general recommendations from VMware, I've added some emphasis:
- Size the virtual machine memory to leave adequate space for the Java heap, the other memory demands of the Java virtual machine code and stack, and any other concurrently executing process that need memory from the guest operating system.
- Set the memory reservation value in the VMware Client to the size of memory for the virtual machine. As any type of Memory Swapping (physical or virtual) is detrimental to performance of JVM heap especially for Garbage Collection.
- Determine the optimal number of virtual CPUs for a virtual machine that hosts a Java application by testing the virtual machine configured with different numbers of virtual CPUs at different times with the same load.
- If you are using multiple Garbage Collector threads in your JVM, match the number of those threads to the number of virtual CPUs that are configured in the virtual machine.
- For easier monitoring and load balancing, use one JVM process per virtual machine.
- If your ESX host is over-committed, ensure that the Balloon Driver is running within the virtual machine so that memory is optimally managed.
Bullet Points: VMware Knowledge Base: Best practices for running Java in a virtual machine
Essentially this means calculating the maximum process size of the Java application running on the guest OS (note that this is likely to be significantly higher than the maximum heap allocation - a 50% increase is a good rule of thumb), then adding the memory requirement for all the other (non-Java) processes that will be running on the OS, then reserving that amount of physical RAM from the host. In practice, this will probably result in the VM reservation matching the guest OS sizing:
Image Source: http://pubs.vmware.com/vfabric52/topic/com.vmware.vfabric.em4j.1.2/em4j/images/memdiagram.png
Sizing your JVM heap and guest OS memory requirements is absolutely crucial - and should always be validated before spending time searching for problems elsewhere.
Fortunately, VMware allows us to set the reserved memory (memres) and memory limit (memlimit) configuration options. I'd recommend that the reservation matches the limit, as even if only 5% of the guest OS' memory allocation is swapped, this will allow the entirety of the problem to become apparent as soon as the guest OS pushes part of the JVM process into that area under load.
Best Practice Example
Ignore the reference to the Perm Gen space if you're using a modern JVM, but the principle applies regardless:
Image Source: VMware: Enterprise Java Applications on VMware Best Practices Guide
Essentially, this means sizing the guest memory for the entirety of the requirements from any Java processes (not just the heap), plus any other system of application processes and the memory required by the OS itself. Ideally, reserve the memory for the guest VM - don't over-commit (share) physical memory and consider disabling memory ballooning. Also, don't forget to allow enough vCPUs for Java to perform GC efficiently. Allow Java to manage its memory elastically - don't force the containing VM to try and keep up.