VM Tuning with Quest’s PerformaSure and the Java Hotspot VM Options

In this blog entry I am going to speak about a recent tuning exercise on an application running on the [tag]Java Hotspot VM[/tag] 1.4.2. It’s great that developers are adopting and using the latest versions of the JVMs but there are still a hell of a lot of applications out in the wild running on older versions of the JVM.

The application in question is a CRM type application, serving about 200 concurrent users, increasing in bursts to 300+ users, processing approximately 4.5 million customer contacts per year. Do the math, and considering the application is running on a single JVM, the application is processing quite an amount of consistent load over the course of a working day.

Recently it was decided to determine the capacity of the current production application server. The rollout of another considerable application is imminent and the customer felt that it was prudent to determine whether or not the current infrastructure had sufficient capacity to run the new system.

To determine the current stress levels on the production application server, we installed [tag]Quest Software’s PerformaSure 4.4[/tag]; a tool that we have used successfully in the past to determine application bottlenecks and give an invaluable insight into the runtime characteristics our production applications.

Over the course of a week, we monitored the CRM application during the core business hours of 7am -> 6pm. It is worth noting that the users reported that they would routinely experience application slowdown during the course of any given day. PerformaSure provides a number of different browsers that provide different view points into the performance of an application. Normally, if you suspected that application slowdown was code related, you would focus on the Request Browser; this browser will breakdown any recorded request layer by layer and component technology by component technology. However, considering the fact that our focus is on tuning the VM and not the application code, we are going to focus on PerformaSure’s Metric Browser.

There are a number of generated graphs that we are going to look at to get a feel for how garbage collection is being managed, and its associated overhead, before tuning.

This first graph shows the percentage of free heap during the course of the day. At no point do we see the graph hit 0% available, so we can safely assume that any slowdown is not due to loitering objects:

Before Tuning - Percent Free Memory

Next we want to look at the garbage collection rate broken down by generation. Remember that Suns’ JVMs (1.3+) are generational and that the heap is broken into a ‘young generation’ and an ‘old generation’ according to object ‘age’. By default, two different GC algorithms are used to manage each of the two generations; the young generation uses a Copying Collector and the old generation uses a Mark Compact Collector. The specifics of these and other GC algorithms mentioned in this blog can be found on Sun’s excellent Turbo-charging Java HotSpot Virtual Machine article.

Before Tuning - GC Rate broken down by generation

The green line above represents the New Generation GC rate and the pink line represents the Old Generation GC rate. The graph can be read as follows; the x-axis is time and the y-axis is GC/second. In the above case we can see a rate of 300 being reached at one point in the graph. This can be a little confusing. What this means is that, during this time slice, the minor GC rate was .3/second (read as thousandths/second on the graph or 300/1000 of a GC/second). This measurement may change to count/second if your GC rate is greater than 1 minor GC/second.

Okay, what can we see from this graph? Well, remember our customer indicated that they sometimes experience slowdown in the application. During this particular day, the slowdown occurred at approximately 14:30 local time. You can use the Zonar functionality in PerformaSure to zoom into this particular time frame, but we can clearly see a radical change in Old Generation GC activity in, or around, this time period.

How can we tell how much impact this increased Old Generation GC activity had on the application? Well, if we correlate the data that we now know with another PerformaSure graph,; the GC Overhead graph, we can determine the impact of the increased GC activity:

Before Tuning GC Overhead

We can clearly see the overhead increasing in tandem with the Old Generation GC rate, eventually reaching a level of 100% overhead. Referring the the Active Session Count graph (see below) we can see that the increase in GC activity and the associated overhead actually coincides with the coming online of another user community, increasing the concurrent number of active sessions from approximately 200 to 250 and depending on circumstances and workload, sometimes 300 sessions.

Before Tuning - Active Session Count

Now that we have diagnosed the cause of the slowdown, what can we do to fix it? Let’s start by looking at the JVM settings:

-Xms512M -Xmx1408M -XX:MaxPermSize=128M

Nothing out of the ordinary here. The application is simply allocated 512M of Heap on startup, growing to a maximum of 1.4Gb with a Permanent Generation Size of 128M (the Permanent Gen is not of concern here so we won’t be tweaking this value). As we confirmed earlier, the application does not have, or exhibit the signs of, loitering objects. So for now, the amount of memory allocated is not of concern. Of course it could be tweaked, but lets get the application performing consistently before we go doing that.

What we want to focus on are the [tag]Garbage Collectors[/tag] used by the VM. I mentioned earlier that the New Gen uses a Copying Collector and the Old Gen uses a Mark Compact Collector. So lets play around with these and see what the results are. Now, it doesn’t make sense, or is it wise, to blindly make changes to a VM’s GC model, so put some thought into any settings, and have an idea of their desired impact, before you apply them. The actual impact can later be confirmed using PerformaSure.

The graphs that we have referred to are only a sample from about two weeks worth of recordings. So based on these and other recordings, I decided to change/add the following to the VM options:

-Xms1408M -Xmx1408M -XX:NewSize=448M -XX:MaxNewSize=448M -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0 -XX:CMSInitiatingOccupancyFraction=60 -XX:MaxPermSize=128M

That’s quite a lot of VM options. So what have we done here?

Firstly, we’ve set the min and max values for the overall heap and New Generation heap to the same value; -Xms1408M -Xmx1408M -XX:NewSize=448M -XX:MaxNewSize=448M. This means that the VM will not consume extra system resources when it needs to scale up the amount of heap allocated. This is not strictly necessary for what we are trying to achieve but is good practice in a production system.

Next we specify our Survivor Ratio; -XX:SurvivorRatio=128. This tells the VM to allocate a very, very small amount of New Gen heap to the To and From Survivor Spaces. This option ties in with the -XX:MaxTenuringThreshold=0 option; this option controls the copying frequency of objects between the survivor spaces before being promoted to the Old Generation. Together, these have the net effect of, if after a minor GC, an object is still live, it will automatically get promoted to the Old Generation.

Next up, we specify two different collectors to be used by both the New Generation and Old Generation; -XX:+UseParNewGC -XX:+UseConcMarkSweepGC. These specify the use of a Parallel Collector in the New Gen and a Concurrent Mark-Sweep Collector in the Old Gen. The specific behaviour of these different collectors can again be found in the turbo charging article.

Our final option of interest is -XX:CMSInitiatingOccupancyFraction=60. This option basically sets a threshold on the percentage of used memory in the Old Gen that is allowed to be reached before a CMS collection is invoked.

So what has been the overall effect of our changes on the GC characteristics of our application?

In this recording we can see the usual increase in the number of active sessions, in fact in this example the session count increases to 321:

After Tuning - Active Session Count

But instead of the normal dramatic increase in Old Gen GC activity, we see a brief spike before the Old Gen GC rate settles back into the uniform pattern that we had been seeing during the recording period:

After Tuning - Garbage Collection Rate

And what about the GC overhead? Well from the below diagram we can see that the Old Gen GC activity now adds a consistent load of ~22% on the application, even during the burst of activity that we saw when the session count increased from ~200 to ~320 sessions:

After Tuning - GC Overhead

What does this mean for the application?

What we have here is a trade off. By specifying the use of the CMS Collector for the Old Gen, or collector is running mostly concurrent with our application, instead of the stop-the-world approach of the default Mark-Compact Collector where user requests would be put on hold during a collection. The average overhead has increased but not to the point where our SLAs cannot be met.

The bottom line is that we now have an application that can scale to meet additional load. Where users were previously experiencing slowdown due to increased load, they now experience consistently high response times, almost irrespective of increases in load.

I would also like to mention now that we have consistent performance and GC activity during the course of a working day, we are continuing tweak the JVM settings; more specifically the amount of Heap allocated at runtime. I feel that 1.4Gb is excessive and based on the percentage of free Heap available during the day, the expected load and GC activity, with continued analysis we can squeeze a bit more performance from the application.

So the next time your application experiences slowdown, don’t jump the gun and the go tearing at the application code. Observe your application’s GC characteristics and determine whether or not it is in fact your application code that is the cause of the slowdown.

Preliminary Results of Tuning

Following a comment by Dmitri , I thought it a good idea to post some graphs that give a more complete picture of the before and the after of this tuning effort.

Before Tuning – New Generation Size, Garbage Collection Rate and Overhead

Before Tuning - New Gen Full Picture

After Tuning – New Generation Size, Garbage Collection Rate and Overhead

After Tuning - New Gen Full Picture

Before Tuning – Old Generation Size, Garbage Collection Rate and Overhead

Before Tuning - Old Gen Full Picture

After Tuning – Old Generation Size, Garbage Collection Rate and Overhead

After Tuning - Old Gen Full Picture

  1. #1 by Dmitri Maximovich on April 18, 2007 - 5:53 pm

    So correct me if I’m wrong, but you’re pretty much ‘bypassing’ tenuring mechanism and letting object to be copied to old generation almost immediately. Have you tried first just to enable concurrent garbage collector(s), possibly adjusting CMS threshold and memory allocation? Would it give you a worse results?

    How do you come up with value 448M for NewSize?

  2. #2 by Jay on April 19, 2007 - 10:41 am

    Hi Dmitri,

    You are correct. If an object survives a single GC cycle the object is promoted to the Old Generation.

    First I should say is that we are not done tuning the VM, and the options used here are being reviewed as time and resources permit.

    Our primary objective in this tuning exercise was to eliminate the occurrence of application slowdown due to non-application related VM activity. This was affecting users on an almost daily basis, so we had to try and alleviate this.

    With the information to hand and the fact that we needed results asap, we took the path of least resistance. I did not want to change the amount of memory allocated to the application, as it was originally increased to 1408M for some reason. So really I wanted to change the behaviour of the garbage collectors. Hence enabling the Parallel Collector and the CMS Collector and to get a clearer sense for what impact these would have I used an extreme Tenuring threshold value.

    I’ve added a results section to the blog which gives a more complete picture of the impact, to the New and Old Generations, of tuning the VM.

    Now that we have a stable environment, my next goal is to reduce the sequential overhead of GC activity as much as is possible. At the moment, we are looking at a consistent combined sequential overhead of ~10% that can handle bursts in load.

    I do agree that we may get better results by further tweaking the thresholding value and memory allocation, and if and when I make these changes I will post the results here.

    How did I come up with the value of 448M for the New Size? Before I made any changes to the VM options, I was able to get a ‘Current Size’ graph for the New Gen from PerformaSure. I could see that the New Gen grew from an initial value of ~210M to ~420M before the application experienced slowdown. As I said, I did not initially want to change the memory allocation for the application, rather just the GC behaviour, hence the New Gen size being ‘kept’ at 448M.

  3. #3 by Rich Taylor on June 24, 2010 - 5:09 pm

    Excellent write up, thanks. We’re doing some tuning as well and it was helpful to hear someone else’s approach to a similar problem.

  4. #4 by laboratory clearances on June 1, 2011 - 11:06 am

    Useful information shared..Iam very happy to read this article..thanks for giving us nice info.

  5. #5 by Wilson Tennis on August 23, 2011 - 3:10 am

    ok! very good!
    you are right!
    thank you!

  6. #6 by nike mercurial superfly on August 24, 2011 - 10:48 am

    An EA representative walked us through the game at this early stage:

  7. #7 by nike superfly cleats on August 27, 2011 - 10:06 am

    a hell of a lot of applications out in the wild running on older versions of the JVM.

  8. #8 by wholesale mac makeup on August 31, 2011 - 8:20 am

    Good, yours articles is very nice!

  9. #9 by Mac Cosmetics on August 31, 2011 - 10:41 am

    quite an amount of consistent load over the course of a working day

  10. #10 by Tennis Rackets on September 3, 2011 - 10:31 am

    your style of writing, please keep up the excellent work. I look forward to reading your future posts

  11. #11 by nike Total 90 on September 6, 2011 - 12:33 pm

    articulate blog !

  12. #12 by nike Total 90 on September 7, 2011 - 8:26 am

    diligent blog ! thank you !

  1. Java Performance Tuning in Ireland at DeCare Systems Ireland Blog
  2. DSI to Present at UK Oracle User Group Conference 2007 at DeCare Systems Ireland Blog

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: