Greg Linden mentions an interesting paper out of Yahoo Research and presented at SIGIR 2008 “ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines”.
The excerpt that Greg republishes in his post talks about how using a pruned index worked against having a cache on the main index since only singleton searches would reach the main index.
Which got me thinking about the index we had at Feedster. I actually implemented the search engine there, and managed its day to day running, so I had quite a bit to say on how it was organized.
The index was broken down into segments (shard if you like) of 1 million entries. The entries were added to the index segments in the order they were crawled, a new index segments being generated every 10 minutes, in other words we would be adding new entries to an index segments until we reached 1 million entries and then we would start a new index segments. We had only a limited number of machines on which to run the search engines, so we organized the overall index into a pipeline, new segments were added to the top and older segments were removed from the bottom and placed into an archive. While the overall index was quite big (about 750GB if you must know), we usually only had about 45 days’ worth of posts seachable, which was fine because our default sort order was reverse chronological and we could satisfy most search terms with that. The 45 days of indices (about 66 index segments) were spread across five machines and this whole setup was replicated for redundancy and load balancing. In front of those there were five search gateways which would handle searches coming from the front end. The search gateway would autoconfigure themselves, searching the local net for search engines, figuring out what was there and putting together a map of the index segments, presenting that as a single index to the front end. So search machines could go down and come back up and stuff would just work, and I did not need to tell the search gateways where the indices were located.
The search engines themselves would create local caches of search results for each index segment, any cache file older than seven days would be deleted to make space for new cache files. One feature I implemented in the cache was a switch which would allow me to cache either the whole search or components of the search. For example, searches could be restricted to a set of blogs, and that restriction was implemented as a search, so it made sense to cache both the search and the restriction separately so that they could be reused for other searches which contained either components.
In addition I implements automatic index pruning, where the search engine (and the search gateway) would know the order of the index, ie reverse chronological, and would search each index segment in turn until the search was satisfied. Users could also control this index pruning from within the search through search modifiers. In fact I implemented a long list of search modifiers most of which we did not document on the site for one reason or another.
At peak times the search engine was handling about 1.5 million searches a day, since the searches were spread across pairs of machines, each machine was handling 750,000 searches/day over an index of 14 million entries.