Number Encoding II

To conclude my little foray into number encoding (see the presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides)), here are a few conclusions:

  • In terms of raw performance the “Varint Encoding” is much faster than “Byte-Aligned Variable-length Encodings” and I was able to get better numbers than Google got, most likely because I am using a different machine. It would be interesting to know what kind of machine/OS they used for their timings so I could do a direct comparison. My lookup array structure is different (and more compact) than Google’s, assuming I understood Google’s lookup array structure in the presentation.
  • The “Byte-Aligned Variable-length Encodings” is faster if you are storing three numbers per posting, namely a document ID, a term position and a field ID. The “Group Varint Encoding” is faster if you are storing four number per posting, namely a document ID, a term position, a field ID and a weight.
  • As I described in the last comment in the original post, two bits are used in the header for each varint to indicate its size in bytes, so 0, 1, 2 or 3 indicate whether your varint is 1, 2, 3 or 4 bytes long respectively. However if you store deltas a lot of the numbers you store will be 0, using a byte to store 0 seems wasteful to me. So I changed this so that the two bits indicate the actual number of bytes in the varint, and 0 bytes means 0. This way I don’t actually allocate space unless there is a value other than 0 to store. This saves about 10% in my overall index size, and a lot more if you only take the term postings into account because I store some amount of document metadata in my index. Of course this means that you can’t store a number greater than 16,777,216 which won’t happen unless you are creating huge indices with more than 16,777,216 documents in them or have documents longer that 16,777,216 terms.

Basically it comes down to trade-offs, index compactness vs. decode speed, and looking at speed both in test code (usually a contrived example) and performance on a real data set. I used the Wikipedia data for that along with 200 relatively complex searches designed to read lots of postings lists.


4 Responses to Number Encoding II

  1. Pingback: Number Encoding III « François Schiettecatte’s Blog

  2. sbuettcher says:

    François, thanks a lot for your in-depth analysis. I recently implemented Group VarInt and, to my surprise, found that a simple bit-shift approach for decoding the selector values (i.e., the 2-bit length indicators at the beginning of each chunk) is just as fast, maybe even a little faster, than using a 256-way lookup table. I measured between 1 and 2 ns per posting for an index created from the TREC GOV2 collection (using an Opteron 154 with 2.8 GHz).

    Detailed numbers are here:

    Regarding your question of why your numbers are better than Google’s: I think you use a different index format. According to Jeff’s talk, their lists contain “flat” word positions without any reference to docids (this is known as a “schema-independent index”).



  3. Stefan

    Thanks, I will take a look at the numbers. Interesting that the bit shift was faster on the Opteron. Could be a CPU specific thing. I have also found that applying different chip specific compiler optimizations can have an effect.

    If you are curious about my code, I did make it available here:

    Comments/feedback very welcome.



  4. cruppstahl says:

    I know that your post is a bit old, but if you’re interested in an up-to-date benchmark of various codecs (including VarintGB) then here’s a good overview:

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: