Number Encoding II
To conclude my little foray into number encoding (see the presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides)), here are a few conclusions:
- In terms of raw performance the “Varint Encoding” is much faster than “Byte-Aligned Variable-length Encodings” and I was able to get better numbers than Google got, most likely because I am using a different machine. It would be interesting to know what kind of machine/OS they used for their timings so I could do a direct comparison. My lookup array structure is different (and more compact) than Google’s, assuming I understood Google’s lookup array structure in the presentation.
- The “Byte-Aligned Variable-length Encodings” is faster if you are storing three numbers per posting, namely a document ID, a term position and a field ID. The “Group Varint Encoding” is faster if you are storing four number per posting, namely a document ID, a term position, a field ID and a weight.
- As I described in the last comment in the original post, two bits are used in the header for each varint to indicate its size in bytes, so 0, 1, 2 or 3 indicate whether your varint is 1, 2, 3 or 4 bytes long respectively. However if you store deltas a lot of the numbers you store will be 0, using a byte to store 0 seems wasteful to me. So I changed this so that the two bits indicate the actual number of bytes in the varint, and 0 bytes means 0. This way I don’t actually allocate space unless there is a value other than 0 to store. This saves about 10% in my overall index size, and a lot more if you only take the term postings into account because I store some amount of document metadata in my index. Of course this means that you can’t store a number greater than 16,777,216 which won’t happen unless you are creating huge indices with more than 16,777,216 documents in them or have documents longer that 16,777,216 terms.
Basically it comes down to trade-offs, index compactness vs. decode speed, and looking at speed both in test code (usually a contrived example) and performance on a real data set. I used the Wikipedia data for that along with 200 relatively complex searches designed to read lots of postings lists.
Number Encoding
A while back I came across this very interesting presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides) where he talks about the various challenges that Google ran into over time as they scaled up and how they solved them.
This abstract sets the stage pretty well
Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval.
In this talk I’ll discuss the evolution of Google’s hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I’ll also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems.
Finally, I’ll describe some future challenges and open research problems in this area.
What particularly appealed to my (gnarly) mind were the various encoding algorithms they tried to encode numbers in their index (pages 54 to 63 of the slides).
Having written a search engine I was particularly interested in how they approached that. I actually settled for what they call “Byte-Aligned Variable-length Encodings” (slide 56) ten years ago having done a bunch of benchmarking (which I repeated every few years to see how well it fared on newer generations of processors.) The encoding is very compact and does very well when you have powerful CPUs and weak IO. I did some revisions over time in the C code which drives it to steer the optimizer, but it has held up well. I used this encoding for the Feedster search engine. As an aside one of the pluses of this encoding is that it is endian independent.
So I was very interested by the “Group Varint Encoding” (slide 63) specifically the claim that the decode time was faster.
Google was able to achieve a decode speed of ~180M numbers/second for the Byte-Aligned Variable-length Encodings and ~400M numbers/second for the Group Varint Encoding.
So I set out to replicate the results and ran into some interesting things.
The machines used for this are an Intel DX58SO motherboard with a Core i7 920 CPU, 2.66GHz, 6GB of RAM, Fedora 11, gcc 4.1.1, and a Mac Pro with 2 Dual Core Xeons, 2.66GHz, 13GB of RAM, Mac OS X 10.5.8, gcc 4.2.1.
The test code is pretty simple, one test writes and reads number from the same area of memory and the second test writes and reads a sequence of numbers over 1GB of memory. I added in verification code to check that the number read was the same as the number written to simulate some in between reading the data.
My results for Byte-Aligned Variable-length Encodings were better, I was able to get ~226M numbers/second on the DX58SO and ~190M numbers/second on the Mac Pro. Removing the verification code I was able to get ~426M and ~405M numbers/second respectively.
And the results for the Group Varint Encoding were much better, I was able to get ~680M numbers/second on the DX58SO and ~500M numbers/second on the Mac Pro. Removing the verification code made reading pretty much instantaneous (less than 1 microsecond to read 4.8B numbers.)
UPDATED October 9th, 2009 – Obviously my own BS meter was off when I wrote that last paragraphs, as I point out in the comments, the optimizer was doing away with most of the code because the results were not used. Preventing it from doing that gave me the same results as the ones I got with the verification code in place and was certainly not instantaneous.
Search User Interfaces
Great post by Greg Linden on a new book from Marti Hearst called “Search User Interfaces“.
The book is available online for free, or you can get it from your favorite bookseller.
The real Google killer
I was having a conversation this weekend with a colleague about WolframAlpha (and tangentially Cuil and Bing) and how it was (is?) billed as a Google killer.
My thought was that the real Google killer will not be search but better ad targeting.
Interviewed by Steve Arnold
I was lucky to be interviewed by Steve Arnold for his “Search Wizards Speak” feature. The full interview is here.
I have known Steve for 10+ years and he and I worked together on a project for a client in Rochester, NY.
“The Triumph of the Wisdom of the Mob.”
By way of Greg Linden, Nicholas Carr writes:
What we seem to have here is evidence of a fundamental failure of the Web as an information-delivery service. Three things have happened, in a blink of history’s eye: (1) a single medium, the Web, has come to dominate the storage and supply of information, (2) a single search engine, Google, has come to dominate the navigation of that medium, and (3) a single information source, Wikipedia, has come to dominate the results served up by that search engine.
Even if you adore the Web, Google, and Wikipedia – and I admit there’s much to adore – you have to wonder if the transformation of the Net from a radically heterogeneous information source to a radically homogeneous one is a good thing. Is culture best served by an information triumvirate?
It’s hard to imagine that Wikipedia articles are actually the very best source of information for all of the many thousands of topics on which they now appear as the top Google search result.
What’s much more likely is that the Web, through its links, and Google, through its search algorithms, have inadvertently set into motion a very strong feedback loop that amplifies popularity and, in the end, leads us all, lemminglike, down the same well-trod path – the path of least resistance. You might call this the triumph of the wisdom of the crowd. I would suggest that it would be more accurately described as the triumph of the wisdom of the mob.
I feel like we have seen this before with blogs where for a while they dominated the top ranking of search results, and at some point this was fixed. This is a little more complicated politically since this is about a single site rather than lots of smaller sites and Google does have a competitor to Wikipedia, so making any changes would necessarily be sensitive.
Clearly both Google and Wikipedia need better competition in the marketplace.
Lucene Gets Support
I have known for a while that Lucid Imagination was in the works, I guess this article makes it official:
“Lucene and Solr are probably some of the most underestimated players in this market,” said CEO Eric Gries. The Solr search server provides a front end to the core Lucene search engine library.
Lucene is in use at more than 4,000 organizations worldwide, according to Lucid Imagination. But companies that are starting to use the search engine for mission-critical applications can’t rely solely “on the good graces of the [open-source] community” for support and quick bug fixes, Gries said.
Demo Work
I have been spending most of this long weekend working on a demo which required me to use SOLR/Lucene. I have not looked at it in any depth for about a year and looks like things have moved along nicely since.
Two things which still irritate me, it seemed to be pretty hard to set up multiple indices on a server, maybe I am missing something, and there is no support for distributed searching, at least not built into the product.
Still, it is really easy to work with.
Full Text Indices in MySQL
Interesting post about full text indices in MySQL:
I rarely use MySQL Fulltext indexes. Their performance is just not good enough, so often its better to just stick with “LIKE” or move to something else like Sphinx, Lucene etc. The only nice thing about them is the ability to compute a match “rank”. Well anyways I had to write a new search plugin for a project that is based around MySQL Fulltext indexes and a match rank and all as well .. except that for some reasons some words just would not produce any results. As I was trying to find a pattern I finally noticed that in my test data some words were used in most rows and exactly those were not matching. Obviously it makes sense to exclude automatically any words that have a very high hit ratio. And indeed the documentation states that by default all words with a hit ratio of over 50% are excluded. Doh!
The performance is indeed not very good. Way back in time Feedster started of using the MySQL full text index and performance really sucked, we gave up after having added about 1 million posts to the database (and handling about 5000 searches a day), and moved to the full text search engine I had written for the task (in all fairness Scott and I were still merging the two systems we had developed, we kept his U/I and database schema, and kept my crawler and full text search engine.)
We moved to a system where the indexer would pull recently added data from the database, index that and make it available for searching, effectively making MySQL a repository. Of course it makes much more sense to have the crawler queue up data for the indexer and bypass the repository completely.





