Thursday, 11 August 2011

Java Threads on Steroids

If you're much into concurrency, then you must have stumbled upon the disruptor concurrency framework engineered and open-sourced by LMAX.
Its performance was compared to the ArrayBlockingQueue which is considered one of the most if not the most effective queue implementation. The numbers are indeed pretty impressive:


I recommend downloading the source code and running the tests for yourself. Martin Fowler recently published a nice insight into the benefits the disruptor approach brings with its application. There is also great series of blog posts by Trish Gee on disruptor internals which is very helpful in not only understanding how this pattern works but also why it is so insanely fast.
But does having this new wonder-weapon in hand mean we have reached the limits of concurrent processing in Java?
Well, not necessarily; the beauty of disruptor's approach lies in its non-blocking behaviour. The only sections involving concurrent reads and writes are handled by memory barriers (using volatile access). This is much better than locking, but does not eliminate problems connected to context switching.
In order to eliminate the cost of context switching we would have to eliminate the switching itself. You can force a thread or a process to run only on a specified set of CPUs thus reducing the probability of kernel migrating it over all cores available to the system. This is called processor affinity. There are several tools that enable setting processor affinity in a very simple manner, ie. Linux control groups or taskset utility. But what if you want to be able to control CPU affinity for individual Java threads?
One way would be to use the RealtimeThread class capabilities from Realtime Specification for Java, but that would imply using non-standard JVM implementation. Poor man's solution could involve using JNI to make native calls to kernel's sched_setaffinity or pthread_setaffinity_np if using POSIX api. To cut the theoretical considerations and learn the practical implications of applying this approach, let's take a look at the results.


This screenshot shows load for all CPUs when the tests were running with default processor affinity. You can see frequent changes in individual CPU loads. This is due to the workload being dynamically distributed among CPUs by system scheduler.


This in turn, shows how the load was distributed when the worker threads were pinned to their dedicated CPUs with fixed processor affinity.

And to illustrate the difference in terms of performance, the below shows the number of operations per second achieved with each approach:


The results not only show significant benefit from applying fixed processor affinity approach in terms of throughput but also do they expose virtual realtime characteristics by offering extremely stable and predictable results which is required by all realtime systems.
Some details:

  • The test being executed was UniCast1P1CPerfTest from the disruptor performance tests suite
  • There were 60 runs with 50.000.000 iterations each
  • CPUs were additionaly occupied by handling IRQs, so reconfiguring irq load balancing by using IRQBALANCE_BANNED_CPUS could render slightly better results
  • The exact number of context switches can be measured using SystemTap or by examining ctxt property value in /proc/stat
  • You can achieve better results by employing Linux cgroups to separate application workload from system tasks by assigning two separate resource pools to those two different groups
  • These results should not be considered a magic trick to speed up your application for every possible scenario. This will be effective only to the CPU-intensive usecases


  • Sunday, 10 April 2011

    Abstraction slows you down

    Update

    As there have been several comments about my microbenchmark, I feel in a position to give a common reply.
    Thank you for pointing out flaws in my test implementation. This time I used snippet from Maciej GorÄ…czka to illustrate the pitfalls of optimization and at the same time to show real impact of interface calls. Of course it had to be modified as it also suffered from the optimization pitfalls.
    There are several things in the area of method call optimizations one have to keep in mind when measuring performance. Some of them have have been described in performance techniques chapter of HotSpot internals.
    With my original benchmark I did not take into account the below.
    Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).
    So there's a lot of truth in saying microbenchmarks can be realy tricky.
    As an example of how forgeting about optimizer may mess up your tests, the below code:
    import java.math.BigDecimal;
    import java.util.Random;
    
    public class OptimizationTest
    {
        private static final long ITERATIONS = 1000 * 1000 * 1000;
        private static final int RUNS = 1;
        private static final SomeInterface[] IFACES = new SomeInterface[] {new Impl1(), new Impl2()};
        private static final Impl1[] IMPLS = new Impl1[] {new Impl1(), new Impl1()};
        private static final Random RANDOM = new Random();
        private static final int CLASS_COUNT = 2;
    
        public static void main(String[] args)
        {
            performTestIface(2); //warm up
    
            final long implTime = performTestImpl(RUNS);
            final BigDecimal runs = new BigDecimal(RUNS);
            final BigDecimal implAvg = new BigDecimal(implTime).divide(runs);
            System.out.println("Invokevirtual: " + implAvg + " ms");
    
            final long ifaceTime = performTestIface(RUNS);
            final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(runs);
            System.out.println("Invokeinterface: " + ifaceAvg + " ms");
        }
    
        private static long performTestIface(final long runs)
        {
            final SomeInterface iface = IFACES[RANDOM.nextInt(CLASS_COUNT)];
            long ifaceTime = 0;
    
            for (int run = 0; run < runs; run++)
            {
                System.gc();
                long start = System.currentTimeMillis();
                for (int i = 0; i < ITERATIONS; i++)
                {
                    iface.doSomething(i);
                }
                ifaceTime += System.currentTimeMillis() - start;
            }
            return ifaceTime;
        }
    
        private static long performTestImpl(final long runs)
        {
            final Impl1 impl = IMPLS[RANDOM.nextInt(CLASS_COUNT)];
            long implTime = 0;
    
            for (int run = 0; run < runs; run++)
            {
                System.gc();
                long start = System.currentTimeMillis();
                for (int i = 0; i < ITERATIONS; i++)
                {
                    impl.doSomething(i);
                }
                implTime += System.currentTimeMillis() - start;
            }
            return implTime;
        }
    
    
        static interface SomeInterface
        {
            void doSomething(int i);
        }
    
        static class Impl1 implements SomeInterface
        {
            private int i;
    
            public void doSomething(final int i)
            {
                this.i = i;
            }
        }
    
        static class Impl2 implements SomeInterface
        {
            private int i;
    
            public void doSomething(final int i)
            {
                this.i = i;
            }
        }
    
    }
    
    Please make sure to run it 10+ times. I hope you'll have fun with the results.
    Now, the explanation.
    When run with -XX:+PrintCompilation flag, the test either outputs:
    1429		test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
    142 	1% 	test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes) 
    1701 	2% 	test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes) 
    Invokevirtual: 745 ms 
    2447 	10 	test.InvokeInterfaceTest2::performTestIface (77 bytes) 
    Invokeinterface: 722 ms
    
    or:
     65     3       test.InvokeInterfaceTest2$Impl2::doSomething (6 bytes)
     66     1%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
     1523   4       test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
     1523   2%      test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
    Invokevirtual: 732 ms
     2255   5       test.InvokeInterfaceTest2::performTestIface (77 bytes)
     2262   1%     made not entrant  test.InvokeInterfaceTest2::performTestIface @ -2 (77 bytes)
     2268   3%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
    Invokeinterface: 1816 ms
    
    Note that HotSpot complains about method made not entrant, which means that previously compiled method is not valid anymore. This happens because we are using 2 different implementations of the same interface like in the faster example, but this time class implementing the interface changed, so target method had to be deoptimized.
    It becomes even more clear when we look at how the optimization have been implemented by the compiler. In the fast run we get native code for performTestIface:
    126		B16: #    N1 <- B2  Freq: 1.01326e-06
    126		movl    RSI, #-28    # int
    12b		movq    RBP, [rsp + #0]    # spill
    12f		movl    [rsp + #4], RAX    # spill
    133		call,static  wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
            # OptimizationTest::performTestIface @ bci:10  L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ L[5]=_ L[6]=_ L[7]=_ L[8]=_ STK[0]=rsp + #8 STK[1]=rsp + #4
            # OopMap{[8]=NarrowOop off=312}
    138		int3    # ShouldNotReachHere
    
    The code gets recompiled after some number of subsequent monomorphic calls reach PerMethodTrapLimit or PerBytecodeRecompilationCutoff. So for the slower run the code gets recompiled to:
    14f   B16: #    N250 <- B9  Freq: 0.25578
    14f       movl    RSI, #-58    # int
    154       movq    [rsp + #16], RBX    # spill
    159       movq    [rsp + #32], R14    # spill
    15e       movq    [rsp + #40], R13    # spill
    163       call,static  wrapper for: uncommon_trap(reason='bimorphic' action='maybe_recompile')
            # OptimizationTest::performTestIface @ bci:49  L[0]=rsp + #0 L[1]=_ L[2]=rsp + #40 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP STK[0]=rsp + #40 STK[1]=RBP
            # OopMap{[40]=Oop off=360}
    168       int3    # ShouldNotReachHere
    168
    16d   B17: #    B3 B18 <- B2  Freq: 0.16983
    16d       decode_heap_oop_not_null RSI,R10
    171       movq    rdi, [RSI + (sizeof(oopDesc) + Klass::secondary_supers_offset_in_bytes())]
        movl    rcx, [rdi + arrayOopDesc::length_offset_in_bytes()]    # length to scan
        addq    rdi, arrayOopDex::base_offset_in_bytes(T_OBJECT)    # Skip to start of data; set NZ in case count is zero
        repne   scasq    # Scan *rdi++ for a match with rax while cx-- != 0
        jne,s   miss        # Missed: flags nz
        movq    [RSI + (sizeof(oopDesc) + Klass::secondary_super_cache_offset_in_bytes())], RAX    # Hit: update cache
        miss:   
    2a7       je     B3  P=0.999999 C=-1.000000
    
    B18: #    N250 <- B17  Freq: 1.6983e-07
    2ad       movl    RSI, #-83    # int
    2b2       movq    [rsp + #8], R13    # spill
    2b7       movq    [rsp + #16], RBX    # spill
    2bc       movq    [rsp + #32], R14    # spill
    2c1       nop     # 2 bytes pad for loops and calls
    2c3       call,static  wrapper for: uncommon_trap(reason='unreached' action='reinterpret')
            # OptimizationTest::performTestIface @ bci:36  L[0]=rsp + #0 L[1]=_ L[2]=rsp + #8 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP
            # OopMap{[8]=Oop off=712}
    2c8       int3    # ShouldNotReachHere
    
    Usually in high performance applications interfaces reflecting patterns like observer or listener are called in a loop with implementations changing frequently. That is why I believe this problem may have practical implications. Thank you gents for the remarks. I obviously failed to supply flawless test proving my arguments the first time. Hope you'll find this one solid and entertaining.

    End of update

    I think we all like to write nice code; we very often employ ideas such as composition, separation of concerns and so on. Above all we quite correctly tend to introduce various levels of abstraction by using interfaces and abstract classes. This approach is perfectly valid for outstanding number of scenarios, but may lead to performance deficiency when we're dealing with low latency requirements. Every time we introduce another level of abstraction, at the same time we add more complexity to the procedure by which JVM resolves target method to be called. Methods in Java are not invoked directly due to the polymorphic nature of the language. For instance:
    final Set set = getImpl(); // returns different implementations 
    set.add(new Object());
    
    
    will have different implications than:
    final HashSet set = getImpl(); // returns different implementations 
    set.add(new Object());
    The former would get translated into:
     ...
    NEW java/lang/Object
    DUP
    INVOKESPECIAL java/lang/Object.<init> ()V
    INVOKEINTERFACE java/util/Set.add (Ljava/lang/Object;)Z
    ...
    While the latter would look more like:
     ...
    NEW java/lang/Object
    DUP
    INVOKESPECIAL java/lang/Object.<init> ()V
    INVOKEVIRTUAL java/util/HashSet.add (Ljava/lang/Object;)Z
    ...
    The difference in JVM's behaviour for these two cases may not be so obvious even after thorough reading of the instructions specification. The procedure for looking up particular method in virtual method table is almost identical, but the open character of INVOKEINTERFACE semantics renders some optimizations impossible. Let's consider the below examples. In the first case:
    class Parent {
       public void method1() { }
       public void method2() { }
    }
    
    class Child extends Parent {
       public void method2() { } // overriden method
       public void method3() { }
    }
    This setup will result in virtual method table that goes something like:
     Parent
     1 Parent.method1
     2 Parent.method2
    Child
     1 Parent.method1
     2 Child.method2
     3 Child.method3
    Here we can observe how the virtual method table for Child class preserves the order of methods for its parent's class method table. It only overrides the reference/link to the overriden methods thus enabling monomorphic calls and optimizations that employ hard-linking to target methods and completely eliminate the need for method table lookup with each call. This in turn provides means for inlining methods, which can lead to a nice performance boost. Now, let's look at the case for interfaces:
    interface SomeInterface {
       void someMethod();
    }
    
    class Parent {
       public void method1() { }
       public void method2() { }
    }
    
    class Child extends Parent implements SomeInterface {
       public void method2() { } // overridden method from Parent
       public void someMethod() { }
    }
    
    class OtherClass implements SomeInterface {
       public void method3() { }
       public void someMethod() { }
    }
    Virtual method table for the above would resemble this:
     Parent
     1 Parent.method1
     2 Parent.method2
     Child
     1 Parent.method1
     2 Child.method2
     3 SomeInterface.someMethod
    OtherClass
     1 OtherClass.method3
     2 SomeInterface.someMethod
    As we can see, there is no order preserved in terms of index of the interface method someMethod(). For Child class, this will be method #3, but for OtherClass entry for this method would be found under #2. The consequence of introducing these megamorphic calls is the additional cost of virtual method table lookup with each method call. To illustrate the impact of megamorphic calls we can run a simple microbenchmark:

    Edit: the below code is obviously flawed (see the update). Use the one from the top of the article.

    import java.math.BigDecimal;
    
    public class InvokeInterfaceTest
    {
        private static final long ITERATIONS = 1000 * 1000 * 1000;
        private static final int RUNS = 10;
        private static long implTime;
        private static long ifaceTime;
    
        public static void main(String[] args)
        {
            performTest(2); //warm up
            ifaceTime = implTime = 0;
            performTest(RUNS);
            final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS));
            final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS));
            System.out.println("Invokevirtual: " + implAvg + " ms");
            System.out.println("Invokeinterface: " + ifaceAvg + " ms");
        }
    
        private static void performTest(final long runs)
        {
            final FooImpl impl = new FooImpl();
            final Foo iface = new FooImpl();
    
            for (int run = 0; run < runs; run++)
            {
                System.gc();
                long start = System.currentTimeMillis();
                for (int i = 0; i < ITERATIONS; i++)
                {
                    impl.doSomething(i);
                }
                implTime += System.currentTimeMillis() - start;
    
                System.gc();
                start = System.currentTimeMillis();
                for (int i = 0; i < ITERATIONS; i++)
                {
                    iface.doSomething(i);
                }
                ifaceTime += System.currentTimeMillis() - start;
            }
        }
    
        private static interface Foo
        {
            void doSomething(int i);
        }
    
        private static class FooImpl implements Foo
        {
            private int i;
    
            public void doSomething(final int i)
            {
                this.i = i; //avoid optimization
            }
        }
    }
    
    Which on my laptop with Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz using Java 1.6.0_26-b03 for 64-bit Linux gives the below results:
    Invokevirtual: 735.4 ms
    Invokeinterface: 1412.4 ms
    In order to minimize the influence of other processeses running in my system, I pinned the test thread to a single core rendering it unavailable to other processes and achiving more stable results (as described in my previous post about java threads).  So, If you have to deal with huge number of invocations and latency is crucial factor, then you would probably have to consider eliminating megamorphic calls at the cost of less abstract, dodgy design.