Number Encoding

A while back I came across this very interesting presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides) where he talks about the various challenges that Google ran into over time as they scaled up and how they solved them.

This abstract sets the stage pretty well

Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval.

In this talk I’ll discuss the evolution of Google’s hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I’ll also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems.

Finally, I’ll describe some future challenges and open research problems in this area.

What particularly appealed to my (gnarly) mind were the various encoding algorithms they tried to encode numbers in their index (pages 54 to 63 of the slides).

Having written a search engine I was particularly interested in how they approached that. I actually settled for what they call “Byte-Aligned Variable-length Encodings” (slide 56) ten years ago having done a bunch of benchmarking (which I repeated every few years to see how well it fared on newer generations of processors.) The encoding is very compact and does very well when you have powerful CPUs and weak IO. I did some revisions over time in the C code which drives it to steer the optimizer, but it has held up well. I used this encoding for the Feedster search engine. As an aside one of the pluses of this encoding is that it is endian independent.

So I was very interested by the “Group Varint Encoding” (slide 63) specifically the claim that the decode time was faster.

Google was able to achieve a decode speed of ~180M numbers/second for the Byte-Aligned Variable-length Encodings and ~400M numbers/second for the Group Varint Encoding.

So I set out to replicate the results and ran into some interesting things.

The machines used for this are an Intel DX58SO motherboard with a Core i7 920 CPU, 2.66GHz, 6GB of RAM, Fedora 11, gcc 4.1.1, and a Mac Pro with 2 Dual Core Xeons, 2.66GHz, 13GB of RAM, Mac OS X 10.5.8, gcc 4.2.1.

The test code is pretty simple, one test writes and reads number from the same area of memory and the second test writes and reads a sequence of numbers over 1GB of memory. I added in verification code to check that the number read was the same as the number written to simulate some in between reading the data.

My results for Byte-Aligned Variable-length Encodings were better, I was able to get ~226M numbers/second on the DX58SO and ~190M numbers/second on the Mac Pro. Removing the verification code I was able to get ~426M and ~405M numbers/second respectively.

And the results for the Group Varint Encoding were much better, I was able to get ~680M numbers/second on the DX58SO and ~500M numbers/second on the Mac Pro. Removing the verification code made reading pretty much instantaneous (less than 1 microsecond to read 4.8B numbers.)

UPDATED October 9th, 2009 – Obviously my own BS meter was off when I wrote that last paragraphs, as I point out in the comments, the optimizer was doing away with most of the code because the results were not used. Preventing it from doing that gave me the same results as the ones I got with the verification code in place and was certainly not instantaneous.


11 Responses to Number Encoding

  1. cardman03 says:

    Perhaps some implementation (or hints) would be very useful.

    I cannot get more than 400 M ints/sec with Group Varint, a rate lower even than that of Variable Byte. During decompression, I use only the masks from the 256-entry table that Jeff mentioned in his presentation. I cannot figure out why number offsets are also stored.

    Regarding your last sentence, about transferring 4,8 Billion integers in less than 1
    microsecond: This means that you are transferring 19200 GB/sec, but this far exceeds the
    Front Side Bus capacity.

  2. I would be happy to give you any suggestions I can as well as share code.

    I made a small modification to the 256 entry table, I actually store byte counts for each number in the group which makes the table a lot smaller than it would be if you use masks. The masks themselves are stored in a second table and are accessed using the byte count as an offset. Better to have two small tables than one table containing masks and byte counts.

    I will recheck the numbers where I get all that speed. I did go over it a number of times and I still got great speed.

    The key to speed here is to avoid any branching, ie avoid using decision points in the code. The “Byte-Aligned Variable-length Encoding” is “slow” because we need to make a decision whether to stop or carry on decoding at every byte. The “Group Varint Encoding” does not need that, all you are doing are offset based lookups, type coercion, and masking.

    Does that help?

  3. I went back over the code and ran some more tests, what is obviously going on is that the optimizer is optimizing the loop away. Getting it not to do that has been a challenge because the loop is running a fixed number of iterations. When the loop is not optimized away, I get the same results as when the checks are in place.

  4. cardman03 says:

    Thank you very much.

    This was very helpful. Indeed, I knew that the existence of branches leads modern CPUs to mis-predict some of them. Hence I managed to remove all conditional statements and loops.

    In fact, here is what I do

    int GetGVInt (unsigned char *buf, unsigned int *out) {
    	unsigned int tag = buf[0];
    	unsigned int i, count;
    //	int tlen = 1;
    	int l1 = LENGTHS[tag].len1;
    	int l2 = LENGTHS[tag].len2;
    	int l3 = LENGTHS[tag].len3;
    	int l4 = LENGTHS[tag].len4;
    	int tlen = l1 + l2 + l3 + l4 + 1;
    	i = 0;
    	count = 0;
    	int val1 = 0, val2 = 0, val3 = 0, val4 = 0;
    	val1 |= buf[1];
    	val1 |= buf[2] << 8;
    	val1 |= buf[3] << 16;
    	val1 |= buf[4] << 24;
    	val1 = val1 & MASKS[tag].mask1;
    	val2 |= buf[l1 + 1];
    	val2 |= buf[l1 + 2] << 8;
    	val2 |= buf[l1 + 3] << 16;
    	val2 |= buf[l1 + 4] << 24;
    	val2 = val2 & MASKS[tag].mask2;
    	val3 |= buf[l1 + l2 + 1];
    	val3 |= buf[l1 + l2 + 2] << 8;
    	val3 |= buf[l1 + l2 + 3] << 16;
    	val3 |= buf[l1 + l2 + 4] << 24;
    	val3 = val3 & MASKS[tag].mask3;
    	val4 |= buf[l1 + l2 + l3 + 1];
    	val4 |= buf[l1 + l2 + l3 + 2] << 8;
    	val4 |= buf[l1 + l2 + l3 + 3] << 16;
    	val4 |= buf[l1 + l2 + l3 + 4] << 24;
    	val4 = val4 & MASKS[tag].mask4;
    	*out++ = val1;
    	*out++ = val2;
    	*out++ = val3;
    	*out++ = val4;
    	return tlen;

    LENGTHS is the first 256-sized table holding the byte counts, whereas MASKS is where the masks are stored. Both are globals. The leading byte is called tag. buf is the compressed byte sequence and out is an array where the decompressed values are stored.
    So for each integer, I am getting four bytes from the sequence (starting from the appropriate point) and then I apply the corresponding mask value.

    When these four lines are not present, the code runs rapidly.

    *out++ = val1;
    *out++ = val2;
    *out++ = val3;
    *out++ = val4;

    However, I need a way to return the decompressed values and everything I’ve tried decelerates decompression dramatically.

    You’ve been very helpful to me, and I would be very grateful to you If you could help me once more.

    Note: Loops are also creating branches. You need to check the conditional statement that controls the loop, at each iteration.

  5. Here are some suggestions, in no particular order:

    – I would favor implementing this as a macro rather than a function, there is overhead to calling functions, macros will make sure that the code is inline rather than leaving it up to the optimizer.

    – It looks like you are using 4 bytes to store each number regardless of how long the number is which will have two side effects, one is that your index will take up a lot of space, and two that decoding will take more time because you have to read more stuff.

    – Bit shifting is expensive and if you don’t need to transfer your data across different endian platforms is not useful. I would treat an integer as a chunk of memory and mask against that. I know this is ugly and won’t work on a non-intel platform but you can do this to store an int:

    #define UTL_NUM_WRITE_VARINT(uiMacroValue, uiMacroSize, pucMacroPtr) {
    unsigned char *pucMacroValuePtr = (unsigned char *)&uiMacroValue;
    switch ( uiMacroSize ) {
    case 4: *pucMacroPtr = *pucMacroValuePtr; pucMacroPtr++; pucMacroValuePtr++;
    case 3: *pucMacroPtr = *pucMacroValuePtr; pucMacroPtr++; pucMacroValuePtr++;
    case 2: *pucMacroPtr = *pucMacroValuePtr; pucMacroPtr++; pucMacroValuePtr++;
    case 1: *pucMacroPtr = *pucMacroValuePtr; pucMacroPtr++; pucMacroValuePtr++;

    and this to decode it:

    #define UTL_NUM_READ_VARINT(uiMacroValue, uiMacroSize, pucMacroPtr) {
    uiMacroValue = (unsigned int)*((unsigned int *)(pucMacroPtr));
    uiMacroValue &= uiVarintMaskGlobal[uiMacroSize];
    pucMacroPtr += uiMacroSize;

    to read the four varints:

    #define UTL_NUM_READ_VARINT_QUARTET(uiMacroValue1, uiMacroValue2, uiMacroValue3, uiMacroValue4, pucMacroPtr) {
    struct varintSize *pvsVarintSizesGlobalPtr = pvsVarintSizesGlobal + pucMacroPtr[0];
    UTL_NUM_READ_VARINT(uiMacroValue1, pvsVarintSizesGlobalPtr->ucSize1, pucMacroPtr)
    UTL_NUM_READ_VARINT(uiMacroValue2, pvsVarintSizesGlobalPtr->ucSize2, pucMacroPtr)
    UTL_NUM_READ_VARINT(uiMacroValue3, pvsVarintSizesGlobalPtr->ucSize3, pucMacroPtr)
    UTL_NUM_READ_VARINT(uiMacroValue4, pvsVarintSizesGlobalPtr->ucSize4, pucMacroPtr)

  6. cardman03 says:

    Thank you very much. Your code is a masterpiece. I’ll check it out tomorrow.

    I agree with comments 1 and 3. However, my compression function does not use 4 bytes per integer. If it did, we would not be talking about compression! Instead, it allocates a number of bytes proportional to the integer’s value.

    I may be reading 4 bytes per integer from the compressed sequence during decompression, but applying the correct mask guarantees that the decoded value is correct. However, compared to your code, it seems slower since

    (i) I read more stuff.
    (ii) I make heavy use of bitwise shifting.

  7. I fully agree with you about my second comment, I was wrong, I just did not read the code well enough.

    Thanks, I am not sure the code is a masterpiece but it works. The main thing I don’t like about it is that the integer cast assumes it is running on a little-endian machine (Intel x86). And the casting might make some purists wince, but I am ok with that, to me byte are bytes whatever they wind up looking like.

    The “Byte-Aligned Variable-length Encodings” is a much cleaner way of handling this because it is architecture neutral and does impose any limits on the size of the numbers. It is slower however, no getting away from that. I have redone the C code to do the decode a number of times and have not gotten it to be as fast as the Varint. I have reduced the decode to a single statement-less “for” loop though.

    Let me know how it goes.

  8. cardman03 says:

    Hello again. As I said, your code is truly spiritual.

    I followed your guidelines and achieved almost instant decompression with this macro

    #define GET_GVINT (cmpr_ptr, val1, val2, val3, val4); {\
    val1 = 0; val2 = 0; val3 = 0; val4 = 0; \
    unsigned int tag = cmpr_ptr[0]; \
    cmpr_ptr++; \
    val1 = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    val1 &= MASKS[tag].mask1; \
    cmpr_ptr += LENGTHS[tag].len1; \
    val2 = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    val2 &= MASKS[tag].mask2; \
    cmpr_ptr += LENGTHS[tag].len2; \
    val3 = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    val3 &= MASKS[tag].mask3; \
    cmpr_ptr += LENGTHS[tag].len3; \
    val4 = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    val4 &= MASKS[tag].mask4; \
    cmpr_ptr += LENGTHS[tag].len4; \

    This code decompresses 500 million integers in less than 45 nanoseconds. Therefore, when
    decompressing on 4 integer values, Group Varint achieves excellent rates. However, when
    decompressing in an integer array, this rate decreases dramatically. Of course, we need to
    decompress in arrays, because during query processing we need to keep portions (blocks)
    of an inverted list into memory. We need to store somewhere the values we have just
    decompressed for many reasons (intersection, scoring, caching, etc, etc).

    #define GET_GVINT (cmpr_ptr, dcmpr_ptr); { \
    unsigned int tag = cmpr_ptr[0]; \
    cmpr_ptr++; \
    *dcmpr_ptr = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    *dcmpr_ptr &= MASKS[tag].mask1; \
    *dcmpr_ptr++; \
    cmpr_ptr += LENGTHS[tag].len1; \
    *dcmpr_ptr = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    *dcmpr_ptr &= MASKS[tag].mask2; \
    *dcmpr_ptr++; \
    cmpr_ptr += LENGTHS[tag].len2; \
    *dcmpr_ptr = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    *dcmpr_ptr &= MASKS[tag].mask3; \
    *dcmpr_ptr++; \
    cmpr_ptr += LENGTHS[tag].len3; \
    *dcmpr_ptr = (unsigned int)*((unsigned int *)(cmpr_ptr)); \
    *dcmpr_ptr &= MASKS[tag].mask4; \
    *dcmpr_ptr++; \
    cmpr_ptr += LENGTHS[tag].len4; \

    The code above decompresses the integers in an array at rates lower than 550
    Notice that the values of the integers are random numbers between 256 and 512. My CPU is a CoreI7 920.

    The decompression loop, is small and clear

    for (i = 0; i < INTEGERS; i += 4) {
    GET_GVINT(cmpr_ptr, dcmpr_ptr);

    I cannot say if this is the most efficient way. I have tested many alternatives, but this
    seems to be the fastest. Unless you have another rabbit in your hat.

  9. I am not sure about the 500 million integers in less than 45 nanoseconds, sounds way too fast to me, maybe it is a typo and should be 1 second. It not I think you are falling into the same trap I fell into where the compiler optimized the code away. The way to test for that is to compile with -g and then with -O3, you should get a 5x-10x speedup, no more. I had to put in some code to check that the numbers had been read properly.

    I am not sure why decompressing to an array would be an issue unless this created some lookup issues. Personally I never use array references to point to things but use what I call roving pointers just like ‘dcmpr_ptr’ in your code, much easier to work with when walking over structures.

    In my code, I assigned the values to fields in a structure which is fast enough.

    I did notice something about your code ‘*dcmpr_ptr++;’ does not need the dereference, ‘dcmpr_ptr++;’ would suffice. I suspect the optimizer is doing away with that anyways since there is no assignment.

    Something you might want to explore, if you are not doing it already, is to store deltas for where possible, I do for document IDs and for term positions. This allows you to compress numbers even more and is alluded to in the Google presentation. You will see a drop in speed but that will be counter-balanced by having to read less data from memory.

    I ran a number of real-life tests on a large index and found that the varint encoding increased the size of an index (versus variable length encoding) by about 20%-30% which had a negative impact on speed. In fact I found that my implementation of variable length encoding was faster than varint because less data was read. This mean that any speed up I got from varint was lost because of the increased data size.

  10. There are couple of other “rabbits” I have been testing here. One is obvious, the other less so.

    The obvious one is to use deltas on documents IDs and on term position (assuming you store term position). The four bytes in the varint suggest document ID, term position, field ID and relative weight to me, so I assume that you are storing similar data.

    The less obvious one is this. Two bits are used in the header for each varint to indicate its size in bytes, so 0, 1, 2 or 3 indicate whether your varint is 1, 2, 3 or 4 bytes long respectively. However if you store deltas a lot of the numbers you store will be 0, using a byte to store 0 seems wasteful to me. So I changed this so that the two bits indicate the actual number of bytes in the varint, and 0 bytes means 0. This way I don’t actually allocate space unless there is a value other than 0 to store. This saves about 10% in my overall index size, and a lot more if you only take the term postings into account. Of course this means that you can’t store a number greater than 16,777,216 which won’t happen unless you are creating huge indices with more than 16,777,216 documents in them or have documents longer that 16,777,216 terms.

  11. Pingback: Number Encoding II « François Schiettecatte’s Blog

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: