Tuesday, 24 December 2013

How to serialise an array of doubles with a byte (binary delta encoding for low-varianced monotonic sets of floating point data)

Low latency systems require high performance message processing and passing. As in most cases data has to be transfered over the wire or serialised for persistence, encoding and decoding messages become a crucial part of the processing pipeline. The best results in high performance data encoding usually involve applying knowledge of application data specifics. The technique presented in this article is a good example of how leveraging some aspects of the data benefits encoding in both latency and space complexity.
Usually the largest messages passed around high frequency trading systems represent some state of an exchange in a form of an excerpt from the order book. That is, a typical market data (tick) message contains information identifying an instrument and a set of values representing the top of the order book. A mandatory member of the data set is price/exchange rate information expressed in real numbers (for buy and sell sides).

What makes that data set very interesting from optimization perspective is that it is:
a) monotonic
b) low varianced
c) non-negative*

So in practice we deal with sets of floating point numbers which are not only sorted (ascending for bid and descending for ask) but also having neighbouring values relatively "close" to each other. Keeping in mind that message processing latency is crucial for most trading systems, market ticks as one of key entities need to be passed as efficiently as possible.

Let me demonstrate how all these features can be exploited to encode numeric data to a very compact binary form. First, two teaser plots demonstrating the difference between different serialisation approaches:
So far we can see that delta encoding yields better results for encoding times as compared to standard ByteBuffer based serialisation but becomes worse than Kryo (one of the most popular high-performance serialisation libraries) as soon as we reach array lengths above 40. What is important here though, is a typical use-case scenario which for high frequency trading market messages happen to fit into the 10-40 range of array lengths
It gets much more interesting when we examine the resulting message sizes (as a function of array length):
The benefits of applying delta encoding should become obvious at this point (that blue curve applies equally to bytebuffer and Kryo serialisation). What happens here is, with some knowledge on the specifics of data being processed it is safe to make assumptions that result in serialization being more CPU-intensive but ridiculously much more efficient space-wise. The idea behind delta compression is very simple. Let's start with an arbitrary array of integral numbers:

[85103, 85111, 85122, 85129, 85142, 85144, 85150, 85165, 85177]

If these were ints, without any compression we would have to use 4 * 9 = 36 bytes to store the data. What is particularly interesting in this set of numbers is that they are clustered relatively close to each other. We could easily reduce the number of bytes required to store the data by referencing the first value and then produce an array of corresponding deltas:

Reference: 85103, [8, 19, 26, 39, 41, 47, 62, 74]

Wow! Now we can downsize to array of bytes. Let's do the calculations again. We need 4 bytes for the reference value (which is still int) and 8 * 1 byte per delta = 12 bytes.
That's quite an improvement from the original 36 bytes but there is still room for some optimisation. Instead of calculating deltas from the reference value let's store differences for each predecessor successively:

Reference: 85103, [8, 11, 7, 13, 2, 6, 15, 12]

The result is a non-monotonic set of numbers characterised by low variance and standard deviation. I hope it has already became clear where this is heading to. Still, it's worth elaborating a little.
What we essentially ended up with at this point is a set that's a perfect candidate for binary encoding. For our example that just means it is possible to accommodate 2 deltas in a single byte. We only need a nibble (4 bits) to accommodate values from 0-15 range so we can easily get away with ultimately squeezing the original array into 4 (for reference) + 8 * 1/2 = 8 bytes.
Since prices are expressed with decimal numbers, applying delta compression with binary encoding would involve establishing a maximum supported precision and treating decimals as integrals (multiplying them by 10 ^ precision), making 1.12345678 with precision of 6 a 1123456 integer. So far all of this has been a purely theoretical speculation with some teaser plots at the beginning of this article. I guess this is the right moment to demonstrate how these ideas may be implemented in Java with 2 very simple classes.
We'll start with the encoding side:

A few words of introductory explanation before I go into details. This code is not a complete, fully-blown solution; it's sole purpose is to demonstrate how easily it is to improve some bits of application's serialization protocol. Since it is subjected to microbenchmarking it also does not induce gc-pressure just because the impact of even the quickest minor gc could severily skew the final results, hence the ugly api. The implementation is also highly sub-optimal, especially CPU-wise but demonstrating micro-optimizations is not the goal of this article. Having said that let's see what it does (line numbers in curly brackets).
Encode method first does some fundamental sanity checks {17,18}, calculates the multiplier used in transforming decimals to integrals {20} and the delegates to calculateDeltaVector(). This in turn has two effects.
1) Calculates rolling delta for the whole set by transforming decimals to integrals, subtracting from predecessors and finally storing the results in a temporary array {33}
2) As a side effect works out the maximum number of bits required to represent a delta {34,36}

The encode() method then stores some metadata required for correct deserialization. It packs precision, delta size in bits and the array length in the first 64 bits {24}. It then stores the reference value {25} and initiates binary encoding {27}.

Encoding deltas does the following:
1) Checks if it already processed all array entries and exit if so {40}
2) Calculates the number of remaining bits to encode {41}
3) Chooses the most appropriate type (given its size in bits), encodes the remaining bits, and writes the bits to the buffer {43-46}
4) Recurses {47}

One last bit probably requiring some elaboration is the encodeBits() method itself. Based on type size (in bits) passed in the argument it loops over a temporary long which sole purpose is to serve as a bitset and writes the bits representing consecutive deltas moving from the most to the least significant part of the long value (scoped by the type size).
Decoding is, as expected, a pure opposite of encoding and is mostly about using the meta data to reconstruct the original array of doubles up to the specified precision:

The source code with some test classes can be found here. Please bear in mind even though it's proven to work this code is certainly not production-ready. You can definitely make it work without requiring the temporary array, replacing full array scan when calculating max delta size with something clever or get away without that heavyweight division by doing division by reciprocal approximation. Feel free to pick those hints up or apply different micro-optimizations and build your own proprietary delta encoding protocol. It makes a vast difference for latency-sensitive trading applications reducing market data message size for liquid instruments by 20-30x. Of course you have to figure out yourself if switching to delta-compression binary encoding brings any value to your application ecosystem. Feel free to post comments with your findings!

Sunday, 8 September 2013


Finally, after years of reading Java and technology blogs I decided to start my own.
First few posts are just re-published articles from DZone's Javalobby from some time ago. All new entries will be posted exclusively here and published as links elsewhere.
I hope the contents would satisfy all Java performance enthusiasts.
If you like my articles, you are welcome to follow me on Twitter.
Have fun!

Saturday, 7 September 2013

Want to get faster with AtomicLong? Make it wait.

I often hear that Java atomic types (java.util.concurrent.atomic) are super-fast and play nicely with highly concurrent code. Most of the time, atomics do their job in a robust and efficient manner. However, there are scenarios in which the hidden cost of unmanaged contention on atomic types becomes a serious performance issue. Let's take a look at how  java.util.concurrent.atomic.Atomic* types are implemented and what that design implies.
All atomic types, such as AtomicLong, AtomicBoolean, AtomicReference, etc., are essentially wrappers for volatile values. The added value comes from internal use of sun.misc.Unsafe that delivers CAS capabilities to those types.
CAS (compare-and-swap) is in essence an atomic instruction implemented by modern CPU hardware that allows for non-blocking, multi-threaded data manipulation in a safe and efficient manner. The huge advantage of CAS over locking is the fact that CAS does not incur any overhead on the kernel level as no arbitrage takes place. Instead, the compiler emits CPU instructions such as lock cmpxchg, lock xadd, lock addq, etc. This is as fast as you can get with invoking instructions from a JVM perspective.
In many cases, low cost CAS gives an effective alternative for locking primitives but there is an exponentially growing cost of using CAS in contented scenarios.
This issue has been examined in a very interesting research by Dave Dice, Danny Hendler and Ilya Mirsky. I highly recommend reading the whole paper as it contains a lot more valuable information than this short article.
I reproduced some concepts from the paper and put them under test. Many Java programmers should find the results quite revealing since there is a common misconception about atomics (CAS) performance.
The code for implementing back-off contention management is fairly simple. Instead of looping over failed compare-and-swaps, it backs off for a very short period letting other threads try with their updates.
The tests were executed on a 64-bit Linux 3.5.0 (x86_64) with Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz (8 logical cores) using 64-bit Hotspot Java 1.7.0_25-b15.
As expected, for high load contention there was no drastic difference between the two implementations:

However, it gets far more interesting at high store contention. This scenario exposes the weakness of the optimistic retry approach employed by AtomicLong implementation from Hotspot.

Similarly, with mixed reader/writer contention the benefits of lightweight access management are also clearly observable.

The results differ significantly when there is inter-socket communication involved, but unfortunately I somehow lost the output from testing against the Intel Xeon-based hardware. Feel free to post results for different architectures/JVMs.