Federated Search

I was prompted to write about federated search after reading Greg Linden’s article on why Google might be rejecting federated search, and taking a closer look at OpenSearch, adding that to a project I am currently working on. Also, about 3 years ago, I built a federate search product for a client, and while the theory behind that project was sound, the implementation turned out to be tricky for reasons that I will explain below.

To quote from Greg Linden’s article:

Federated search (or metasearch) is when a search query is sent out to many other search engines, then the results merged and reranked.

I would add to this that the search engines must be different, ie they must implement different search syntaxes, different ranking algorithms & different schemas. These challenges are solved from the get-go if all the search engines are the same. You will still need to deal with search engine selection, latency and reliability, but these are less challenging to solve.

If federated searching is so challenging, why would you even want to do it? The main reason is when a single single search engine does not have access to all the data you want to search. Data will be spread across searches, and amalgamating all the data in not possible for economic or political reasons. You may also want to find all the data available on a particular subject, hence the need to search multiple search engines.

The first issue to confront are protocol. A number of protocols are currently available, OpenSearch, coming out of A9, extends RSS, making it very easy to support since producing and consuming that format is well understood. SRW/SRU and MXG are two others however I am not so familiar with them but this article from Thom Hickey talks about them. Going back in history, we also find STARTS, WAIS and Z39.50. STARTS is very interesting because it returns term and document metadata along with the search results, enough metadata in fact for the federated search engine to rank the document using whatever flavor of tf.idf it implements. Unfortunately when the STARTS spec came out, a patent was awarded to Steve Kirsch at Infoseek which covered this method, making it difficult to use. WAIS is based on Z39.50, and both go back a long way. WAIS only supported normalized rankings for documents, so these could not be compared across search engines. I think there may be support for ranking in Z39.50 but I am not sure, but I am pretty sure SRW/SRU does. SRW/SRU comes out of the same community which produced Z39.50. Lastly, there is good ol’ screen scraping which is used when a search engines does not present an API. This is by far the ugliest approach and should be avoided at all costs since it is very brittle as it relies on the search engine generating the same HTML all the time.

The second issue to confront is search syntax. If you are very, very lucky, all the search engines support the same search syntax, the Google search syntax being a popular one. Most likely the search engines will support different search syntaxes and you will need to map your own search syntax to the native search syntax of the various search engines. While this may sound tractable at first glance, it quickly becomes a quagmire because support for search syntax features is very uneven (trust me, been there, lots of grey hair, etc…) The only way to deal with this issue is to create your own reference search syntax, and map it to the search engine syntax. For what it’s worth, I have found that using XSLT is a good way to map searches from one search syntax to another.

This dovetails nicely into the third issue, namely schema. By schema, I mean the search fields supported by a search engine, such as ‘title’, ‘text’, ‘url’, etc… Search engines will support a wide variety, but there is (small) core set which most support. Again, mapping from your own reference search syntax to the search engine syntax is the way to go.

The fourth issue is that of search routing. If you have 100+ search engines in your list of searchable engines, it makes sense to limit the search to those engines where you are likely to find documents. In a system where you have no information on what each search engine contains, you will need to search all of them, but if you have an idea of what they contain, then you can make a choice whether to search it or not, which will alleviate the search load both on the search engine and the federated search engine.

Which brings up the issue of search engine metadata. A search engine may share metadata about the documents it contains, STARTS supported this by allowing the term dictionary to be shared so you could see if a term existed in a search engine before actually searching. I have also come across research in this area where hashes and signature files are used to share metadata. Issues with this approach include currency of metadata, and size of the metadata shared.

The sixth issue is ranking. When you get a set of search results from various sources, they may need to be merged and ranked for presentation to the user. Unless the search engine share enough metadata, this will be difficult. Clustering is an option, but that has not proven popular with consumers, but that will likely not be the case with professional searchers.

The seventh issue is latency. You will need to decide whether how long the searches are allowed to run for so that slow search engines don’t slow down the entire federated search. The best way to handle this is to either set some sort of reasonable timeout, say 5-10 seconds, or allow the user to set that timeout when they issue the search. And if possible report on the searches that failed, so the user gets a sense of the success or failure of the search.

Finally, if a search engine is consistently slow and/or unreliable, then it should be taken out of the list of available search engines.

In conclusion, this is not an easy problem to tackle. As I mentioned above, I have built a federated search engine, and only had to deal with issues one, two, three, seven and eight. And it still turned out to take 5 months to implement. The biggest challenge was that the federated search engine had to do everything using screen scrapping, which turned out to be a maintenance nightmare.

Leave a comment