Ad Retrieval
I have been catching up on Daniel Tunkelang’s weblog and came across a write up about the SIGIR 2009 presentation by Vanja Josifovski.
The crux of the presentation is that treating ads as a document in IR works well if you evaluate the search over the ad corpus.
What is interesting is that I spent a weekend testing just that when I was at Feedster. I pulled the ads from our ad provider, did a little bit of cleanup and indexed them into an index. I think there were about 80,000 ads so this step was very quick, on the order of seconds.
I then tried would run sample searches on the index to see what ads were retrieved and it seemed to work pretty well.
I also tested a scenario which took the text from posts in a weblog and ran that text as feedback over the ads index. By feedback I mean that there was no search but just a ‘bag of terms’ used to rank the ads. Again this worked pretty well. I pulled a popular feed (which shall remain nameless) and good ads that were very relevant to the individual posts in the weblog. Even using a large number of terms for feedback was not a problem because the ads index was so small, searches would run on the order of milliseconds.
Unfortunately we did not use this which was unfortunate, but it did show me that it could be done.
Chromulated Thoughts
So a friend and I have been batting a few emails back and forth about ChromeOS after I downloaded a disc image and played around with it for about 30 minutes. I also listened to some of Paul Thurrott’s thoughts on it on Windows Weekly and went back to play with it a little more. Generally I found it to be pretty good if a little slow, but this was running on a virtual machine and it is a year away from release.
So here are some chromulated thoughts about why this is a big deal:
- This will open up new competition around the OS at the netbook end of the market. The first netbooks shipped with various versions of linux but Microsoft sensed that this was an issue and made sure that Windows XP and then Windows 7 became the dominant OS on those machines by bringing down licensing costs for netbook vendors. ChromeOS will re-open that conversation because it is free and it has Google’s backing. One possible outcome is that Windows will become free on netbooks.
- This is one of those rare moments when Microsoft is up against a competitor (Google) that is as motivated and has deep pockets as it does. Motivation and money will enable Google to keep developing and pushing ChromeOS past the point where other might give up. Microsoft is not a stranger to this strategy having been well served by it in the past where products did not hit their stride until the third or fourth version.
- Google is being very careful to present ChromeOS as something a user might run on a secondary computer, thereby avoiding directly challenging Microsoft (we all know what happened to the last company who did that!)
- Apple will probably not be affected by this since it is not (yet) present at that end of the market. An Apple notebook is not something people looking to buy a netbook will consider.
Longer Search Strings
I have been involved in search for a while (on quite a while), and first heard about search string lengths at the 1993 SIGIR conference at Pittsburgh. I believe it was Doug Cutting who was presenting some data on that (my memory is hazy here) and how we were slowing getting up to 2 search terms per search. I think this was an ad-hoc presentation so I have no reference for that, but I know that Amanda Spinks has done a number of studies on search lengths (Google search).
Steve Arnold believes the current number to be around 2.5 which I think is on target, and the trend is up.
Steve points to this study on Hitwise which show those trends, though I would have liked more depth to the figures so we could really see the trend.
Which is, in fact, hard to see, because the change is so slowwww.
At Feedster we were seeing around 2 terms per search, and I put in code to boost the relevance of documents where terms were very close together (ie a phrase) to get better results.
Circumvallation
I am have been reading Steve Arnold’s weblog on search, I have known Steve for over ten years now and he likes to challenge the status-quo and pushing people to see beyond the status-quo.
So it was very interesting to read his post about how Google is challenging Reed Elsevier and Thomson by indexing legal texts:
Google has added the full text of US federal cases and state cases. The coverage of the federal cases, district and appellate, is from 1924 to the present. US state cases cover 1950 to the present. Additional content will be added; for example, I have one source that suggested that the Commonwealth of Virginia Supreme Court will provide Google with CD ROMs of cases back to 1924. Google, according to this source, is talking with other sources of US legal information and may provide access to additional legal information as well.
His thesis is that the incumbents are like Vercingetorix stuck in Alesia (1) and that Google is like Ceasar who built two sets of wall around Alesia, one to keep the Gauls in, and the other to keep any relieving force out.
I like the analogy though it is not quite there, Google is not exactly laying siege and they don’t have to defend themselves. On the other hand the incumbents probably feel very much like the Gauls stuck in Alesia.
I was catching up with a long time friend earlier this week (a much smarter person than me), we were talking about lots of thing and one of those things was how particular species will move fluidly from one niche to another as they evolve. My feeling was that sometimes this happens in a fluid fashion without much struggle, but sometimes it can be quite violent resulting in the decimation of one or the other species (2). I wonder if this is closer to what is happening here. Google in moving in on an established market, though not in an explicitly deliberate fashion, and causing discomfort to the incumbents.
Now that a good portion of the data these incumbents charge for is available for free (it always was available for free, but access was difficult), it will likely force them to change their business model if they are to stay relevant. Steve makes that point very explicitly at he end of his post, for example:
Finally, what will be vulnerable to Google disruption will be difficult to use, expensive, and incomplete services. Maybe Reed Elsevier, Thomson Reuters, and Wolters Kluwer should merge. That will give the present crop of senior managers time to cash out. I don’t see an easy, quick, inexpensive, or painless way to prevent the lessons of Alesia being writ large in tomorrow’s digital headlines.
1 – I learned all about the siege of Alesia at when I was at school.
2 – For example, the introduction of Lionfish in the Caribbean is resulting in major population reduction in some indigenous species on the reefs.
Number Encoding II
To conclude my little foray into number encoding (see the presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides)), here are a few conclusions:
- In terms of raw performance the “Varint Encoding” is much faster than “Byte-Aligned Variable-length Encodings” and I was able to get better numbers than Google got, most likely because I am using a different machine. It would be interesting to know what kind of machine/OS they used for their timings so I could do a direct comparison. My lookup array structure is different (and more compact) than Google’s, assuming I understood Google’s lookup array structure in the presentation.
- The “Byte-Aligned Variable-length Encodings” is faster if you are storing three numbers per posting, namely a document ID, a term position and a field ID. The “Group Varint Encoding” is faster if you are storing four number per posting, namely a document ID, a term position, a field ID and a weight.
- As I described in the last comment in the original post, two bits are used in the header for each varint to indicate its size in bytes, so 0, 1, 2 or 3 indicate whether your varint is 1, 2, 3 or 4 bytes long respectively. However if you store deltas a lot of the numbers you store will be 0, using a byte to store 0 seems wasteful to me. So I changed this so that the two bits indicate the actual number of bytes in the varint, and 0 bytes means 0. This way I don’t actually allocate space unless there is a value other than 0 to store. This saves about 10% in my overall index size, and a lot more if you only take the term postings into account because I store some amount of document metadata in my index. Of course this means that you can’t store a number greater than 16,777,216 which won’t happen unless you are creating huge indices with more than 16,777,216 documents in them or have documents longer that 16,777,216 terms.
Basically it comes down to trade-offs, index compactness vs. decode speed, and looking at speed both in test code (usually a contrived example) and performance on a real data set. I used the Wikipedia data for that along with 200 relatively complex searches designed to read lots of postings lists.
Number Encoding
A while back I came across this very interesting presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides) where he talks about the various challenges that Google ran into over time as they scaled up and how they solved them.
This abstract sets the stage pretty well
Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval.
In this talk I’ll discuss the evolution of Google’s hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I’ll also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems.
Finally, I’ll describe some future challenges and open research problems in this area.
What particularly appealed to my (gnarly) mind were the various encoding algorithms they tried to encode numbers in their index (pages 54 to 63 of the slides).
Having written a search engine I was particularly interested in how they approached that. I actually settled for what they call “Byte-Aligned Variable-length Encodings” (slide 56) ten years ago having done a bunch of benchmarking (which I repeated every few years to see how well it fared on newer generations of processors.) The encoding is very compact and does very well when you have powerful CPUs and weak IO. I did some revisions over time in the C code which drives it to steer the optimizer, but it has held up well. I used this encoding for the Feedster search engine. As an aside one of the pluses of this encoding is that it is endian independent.
So I was very interested by the “Group Varint Encoding” (slide 63) specifically the claim that the decode time was faster.
Google was able to achieve a decode speed of ~180M numbers/second for the Byte-Aligned Variable-length Encodings and ~400M numbers/second for the Group Varint Encoding.
So I set out to replicate the results and ran into some interesting things.
The machines used for this are an Intel DX58SO motherboard with a Core i7 920 CPU, 2.66GHz, 6GB of RAM, Fedora 11, gcc 4.1.1, and a Mac Pro with 2 Dual Core Xeons, 2.66GHz, 13GB of RAM, Mac OS X 10.5.8, gcc 4.2.1.
The test code is pretty simple, one test writes and reads number from the same area of memory and the second test writes and reads a sequence of numbers over 1GB of memory. I added in verification code to check that the number read was the same as the number written to simulate some in between reading the data.
My results for Byte-Aligned Variable-length Encodings were better, I was able to get ~226M numbers/second on the DX58SO and ~190M numbers/second on the Mac Pro. Removing the verification code I was able to get ~426M and ~405M numbers/second respectively.
And the results for the Group Varint Encoding were much better, I was able to get ~680M numbers/second on the DX58SO and ~500M numbers/second on the Mac Pro. Removing the verification code made reading pretty much instantaneous (less than 1 microsecond to read 4.8B numbers.)
UPDATED October 9th, 2009 – Obviously my own BS meter was off when I wrote that last paragraphs, as I point out in the comments, the optimizer was doing away with most of the code because the results were not used. Preventing it from doing that gave me the same results as the ones I got with the verification code in place and was certainly not instantaneous.
Search User Interfaces
Great post by Greg Linden on a new book from Marti Hearst called “Search User Interfaces“.
The book is available online for free, or you can get it from your favorite bookseller.
The real Google killer
I was having a conversation this weekend with a colleague about WolframAlpha (and tangentially Cuil and Bing) and how it was (is?) billed as a Google killer.
My thought was that the real Google killer will not be search but better ad targeting.
Interviewed by Steve Arnold
I was lucky to be interviewed by Steve Arnold for his “Search Wizards Speak” feature. The full interview is here.
I have known Steve for 10+ years and he and I worked together on a project for a client in Rochester, NY.






leave a comment