Number Encoding

A while back I came across this very interesting presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides) where he talks about the various challenges that Google ran into over time as they scaled up and how they solved them.

This abstract sets the stage pretty well

Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval.

In this talk I’ll discuss the evolution of Google’s hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I’ll also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems.

Finally, I’ll describe some future challenges and open research problems in this area.

What particularly appealed to my (gnarly) mind were the various encoding algorithms they tried to encode numbers in their index (pages 54 to 63 of the slides).

Having written a search engine I was particularly interested in how they approached that. I actually settled for what they call “Byte-Aligned Variable-length Encodings” (slide 56) ten years ago having done a bunch of benchmarking (which I repeated every few years to see how well it fared on newer generations of processors.) The encoding is very compact and does very well when you have powerful CPUs and weak IO. I did some revisions over time in the C code which drives it to steer the optimizer, but it has held up well. I used this encoding for the Feedster search engine. As an aside one of the pluses of this encoding is that it is endian independent.

So I was very interested by the “Group Varint Encoding” (slide 63) specifically the claim that the decode time was faster.

Google was able to achieve a decode speed of ~180M numbers/second for the Byte-Aligned Variable-length Encodings and ~400M numbers/second for the Group Varint Encoding.

So I set out to replicate the results and ran into some interesting things.

The machines used for this are an Intel DX58SO motherboard with a Core i7 920 CPU, 2.66GHz, 6GB of RAM, Fedora 11, gcc 4.1.1, and a Mac Pro with 2 Dual Core Xeons, 2.66GHz, 13GB of RAM, Mac OS X 10.5.8, gcc 4.2.1.

The test code is pretty simple, one test writes and reads number from the same area of memory and the second test writes and reads a sequence of numbers over 1GB of memory. I added in verification code to check that the number read was the same as the number written to simulate some in between reading the data.

My results for Byte-Aligned Variable-length Encodings were better, I was able to get ~226M numbers/second on the DX58SO and ~190M numbers/second on the Mac Pro. Removing the verification code I was able to get ~426M and ~405M numbers/second respectively.

And the results for the Group Varint Encoding were much better, I was able to get ~680M numbers/second on the DX58SO and ~500M numbers/second on the Mac Pro. Removing the verification code made reading pretty much instantaneous (less than 1 microsecond to read 4.8B numbers.)

UPDATED October 9th, 2009 – Obviously my own BS meter was off when I wrote that last paragraphs, as I point out in the comments, the optimizer was doing away with most of the code because the results were not used. Preventing it from doing that gave me the same results as the ones I got with the verification code in place and was certainly not instantaneous.

Light Peak

I came across Light Peak this morning on Engadget and MacRumors.

To me the really interesting thing is that this has the potential to replace all the connections on the computer, connections to external (and internal) drives, mouse, displays, whatever with a single cable and connector type, which would be really neat. No reason why you should not be able to replace network connections either.

I think this has real potential, can’t wait to see how this develops.

Caribbean Reef Shark

This is a nice shot of a Caribbean Reef Shark, we did see them on most of our dives (including night dives) and I got lots of shots of them. Most of the shots I got were not that good (I will blame the gear here as opposed to the photographer but I think it is fair, these shark move quite fast and the camera is very slow to focus and meter the shot.)

Two interesting things about this shot. The first is that I noticed that a lot of sharks were passing over me as I was lying on the sand below, passing quite close in fact, so I decided to lie on my back and take shots like that. I did feel quite vulnerable though. The second thing is that you can see the hull of the boat above shark just to the left of it.

Cute Turtle

I came across this cute turtle on one of our last dives, depth of maybe 50 feet, it was munching on a coral head, moving from one side to the other, not really paying too much attention to me and I was able to get quite close. In fact I just lay there for a while just watching it, letting it get comfortable with me and taking shots. At one point it turned towards me and I managed to take this shot. I was really close to it, a foot away at most.

Snow Leopard Bugs

I have run into three really irritating Snow Leopard bugs and one feature change which is only mildly irritating.

The three really irritating bugs are as follows:

  • The main display (I have three attached to my computer) will go black for no apparent reason, at least it did until I decided to do a clean install. This is a documented problem and seems to have gone away since I did a clean install but we shall see. If it does start up again I will back down to 10.5.8
  • Sometimes the video on the main display will “wobble” when it comes back on after being turned off for a while, not a show stopper because it only lasts 10-15 seconds but still.
  • The main display will “forget” its color calibration profile after being off for a certain period of time, this is probably linked with the “wobbly” display bug mentioned above.

Finally the mildly irritating feature change is that Snow Leopard now does not respect the creator code when opening a documents, it only respects the file name extension.

Search User Interfaces

Great post by Greg Linden on a new book from Marti Hearst called “Search User Interfaces“.

The book is available online for free, or you can get it from your favorite bookseller.

New iPods

Well the new iPod line-up is underwhelming, the shuffle got colors, the nano got new colors, a new finish (meh), a video camera (meh), an FM radio (double meh), I would have preferred more storage, and the touch got, well nothing except for more storage.