Microbenchmarking Java – Compare Algorithms

Have a look at this presentation where this post is presented as pitfall #3 !

Lesson learned. See comments🙂

There are a lot microbenchmark tips out in the www. The intent of them differs from author to author.

Today I want to show how you could compare e.g. different algorithms. You could simply do:

int COUNT = 1000000;
long firstMillis = System.currentTimeMillis();
for(int i = 0; i < COUNT; i++) {
  runAlgorithm();
}
System.out.println("Mean runtime of algorithm in seconds:"+(System.currentTimeMillis()-firstMillis) / 1000.0 / COUNT);

There are several problems with this approach, which results in very unpredictable results:

  1. Make sure the variable COUNT is high enough (if runAlgorithm() is unexpensive). Then make sure you run this code several times. Additionally you should measure the RMS either for each runAlgorithm call or at the end of this code snippet for a better comparison with other algorithms.
  2. You should turn off the JIT-compiler with specifying -Xint as a JVM option, otherwise your outcome could depend on the (unpredictable) JIT and not on your algorithm.
  3. You should start with enough memory, so specify e.g.-Xms64m -Xmx64m. Because JVM memory allocation could get time consuming.
  4. Quit all other applications or at least you should use
    long firstNanos = mx.getCurrentThreadCpuTime();
    ThreadMXBean mx = ManagementFactory.getThreadMXBean();

    instead of

    long firstMillis = System.currentTimeMillis();
  5. Avoid that the garbage collector runs while your measurement: -Xnoclassg
    But make sure you have enough memory!

After this I got relative stable results: The difference of maximal and minimal value for the result is less then 4% after 10 runs.

8 thoughts on “Microbenchmarking Java – Compare Algorithms

  1. I see the technique described in this post is ultimately broken. It also contradicts with the best practices of measuring the Java performance. Normally I would just post the link like this: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java , but I want you to understand the particular flaws in this particular approach.

    1. Best practices dictate you should have several runs of the code before actually getting into the performance measurement. That is because JVM dynamically compiles and optimizes (!) the code basing on the runtime information. In your particular example this compilation and optimization overheads are incorporated into the final metric, because both unoptimized and optimized code for runAlgorithm() is running during the measurement. There should be the outer loop for benchmark iterations, and you should get the performance metrics of several latest iterations only (if we are talking about measuring algorithm performance, not about measuring JIT compiler performance).

    2. The biggest concern I have here. What’s the point of comparing the algorithms with JIT compiler turned off? This does not reflect real algorithm performance since many major code optimizations would be missed. This does not reflect even asymptothical algorithm performance, because the interpreter is working. This way you can only measure how particular interpreter works on particular code. No one is running the Java code on interpreter these days, so this comparison would be useless anyway. Moreover, the result is heavily dependent not only the interpreter, but on Java bytecode compiler!

    3. Most modern JVMs are adapting the heap size during the runtime, and do that very clever. But yes, specifying the static amount of Java heap should get more predictability.

    4. Are you sure about ThreadMXBean accuracy? I’m sure only about System.currentTimeMillis() and System.nanoTime(). I see why you’re trying to use it: to tackle the time spent in particular thread. Having the clean environment (e.g. no processes running) is the must for fair performance comparison, and that neglects the benefits of using ThreadMXBean over System.nanoTime().

    5. To curcumvent GC impact during the measurement, ask for GC between the iterations with System.gc(). Though it’s not guaranteed to run, it’s the only fair control you have in the platform. Disabling GC during the run seem to be unfair: managing memory is the essential part of algorithm, right? Eliminating memory management lead to making comparisons of non-realistic algorithms.

    BTW, “-Xnoclassg” does NOT disable GC, it disables specific “class unloading feature”, which can be both beneficial or degrading.

  2. Thanks Aleksey for your comments!

    > “-Xnoclassg” does NOT disable GC,

    thanks for clarification! How one can disable GC?

    > 1.

    I do not understand this point. That is the reason I disabled JIT. Or do you mean the JVM will optimize nevertheless?

    > 2.

    Okay, here you are right.
    But if we assume the same good compilation for all my algorithms, then skipping the compilation shouldn’t be relevant to *compare* algorithms. Or what do you think?

    > 3.

    sure, but for my usecase I don’t want a clever JVM

    > 4. … Having the clean environment is the must for fair performance comparison …

    of course. In practise I used the ThreadMXBean to increase stability of my results

    > 5. managing memory is the essential part of algorithm, right?

    Hmmh, yes and no. less memory is better. But memory *management* should be done by the JVM, I think, and so I excluded it from the benchmark. To compare memory consumption I would use a profiler.

    Maybe my approach wants to be only a cpu-usage-test; other tests against memory management or JIT-compiler should be done independently and later additionally.

    How would you compare your algorithms, if one call of runAlgorithm() takes too long to be called several times (so that I can use the benchmark with JIT compilation) ?

  3. (1) is about measuring in the presence of various startup-time effects. Most prominent one is JIT compilation, others include class loading and verification, etc. This is better to be sweeped off the measurement. That’s why best practice proposes to make several iterations and discard several first ones, focusing on latter ones, when so called “steady state” is reached.

    (2) If you have assumptions about compilation results, why not to have assumption about algorithm performance (and stop doing the benchmarking)?😉 The reality is, you are not comparing algorithms anyway! Even if you going for interpreter mode, you are stuck with particular performance penalties of particular interpreter implementation, not only with algorithmic differences. You’ll have the interpreter executing together with your code (e.g. for maintaining of stack-based VM state), which IMO blurs the algorithm performance differences. Compiled code, on the other hand, not only has the optimizations applied, but also it does not have the interpreter behind it, thus stressing the code itself.

    (4) I’m not trusting ThreadMXBean because I don’t understand how it works exactly (e.g. what the latency/variance/etc. it has). I would stick with well-known System.currentTimeMillis() or System.nanoTime(), and recommend you to do the same😉 If you’re trying to improve the predictability of your results, do it the scientific way: gather dataset, apply statistcs, compute the mean/variance, do the T-test. Don’t resort to use the instrument that just provides you with more predictable result if you don’t know how it works. It can turn out that ThreadMXBean is simply not accurate enough, so its results *seem* to be more predictable.

    (5) Ehrm. I’m talking about comparing the real algorithms with all costs associated with using one. That said, I often presume the memory allocation/retention is the part of algorithm, because it is essential for algorithm to work. Ok, garbage-collected languages enables you not to think about memory retention, but that does NOT eliminates the memory retention at all😉 Even though the Java runtime takes care of your garbage, it does not mean that algorithm cost is now free from accounting memory retention costs. The point here is, GC cost of sweeping the garbage after the algorithm should be accounted with algorithm cost. BTW, you incorporate the memory allocation costs into the measurement (new Object(), right?😉.

    If runAlgorithm() takes a lot of time to execute, so you haven’t the time to run at least several measurements, you’re doomed!🙂 Try ahead-of-time compilation then. But I haven’t crossed such bad situations. I had worked with the benchmarks running for >24h, still, we had managed to make several measurements.

  4. Ok Aleksey! You convince me to do an all in one – more complicated – micobenchmarking.
    So, the only point I should be aware of is the warmup phase and putting a System.gc() before my inner loop?

    > This is better to be sweeped off the measurement

    okay I understand!

    > you incorporate the memory allocation costs into the measurement (new Object(), right?😉 .

    uhm, yes. true🙂

    > (2) If you have assumptions about compilation results

    I only meant that I assume that JIT-compiler optimizes my first algorithm implementation as nearly good as my second one. I am relative sure that this is the case, if I will code in the same ‘manner’ for both algos.

    > I had worked with the benchmarks running for >24h

    hmmh this is not really an option for me🙂

    Again thank you very much Aleksey!

  5. Hope Aleksey is reading this further!🙂

    Now I tried to benchmark without the -Xint option.
    But I do not get stable results like with this option.

    What could be wrong if my runAlgorithm() gets slower and slower (~ 5% on each call)?
    I tried to call System.gc() before but this does not help and there is enough memory: more than 60 MB are free…

    • Wow, it is interesting. Something is going wrong there😉 Can you show me the code? aleksey [dot] shipilev [doggy] gmail [dot] com

Comments are closed.