Search engine pet peeves

Every day I come across sites which have a search box where you can enter a search and get some results. And every day I am disappointed by the functionality (rather the lack thereof) implemented within.

So here is a list of pet peeves:

Don’t use stop lists. Stop lists are bad because they throw away important information. At some point they were useful because computers had very limited resources in terms of CPU cycles and disc space, but no more. I want to be able to retrieve documents which contains “to be or not to be” or “Vitamin A”, neither of which are possible if stop lists are in effect.

The default operator between term should be “AND”. Ok this one is conditional. For a search engine which relies on links to calculate document relevance (such as Google, and others) this is not needed, in fact it may even hurt. But for search engines which search a batch of documents and ranks them according to a tf.idf measure, ANDing the terms is a good thing because the search results will get more precise as the user adds more terms. Which is makes sense. I have seen search engines out there which OR the search terms, so that the search became less and less precise as terms were added to the search.

Detecting phrases in document and ranking those documents higher is a very good idea. Going back to our phrase “to be or not to be”, these are all terms which occur any which where in a document, but in that sequence are very important. Well they are very important to Hamlet.

Google has conditioned us to a set of operators to control the search, such as the use of quotes to indicate phrases, ‘+’ to indicate a required term, and ‘-‘ to indicate an unwanted term. Your search engine may (and probably does) support lots of operators, but should support the Google ones, even if this is implemented as a search translation layer.

Tokenization is also important, and needs to be more than just breaking on spaces and punctuation. Good examples of this include such terms as ‘.net’ or ‘C++’, even better ‘asp.net’. A more complication one would be ‘foo@bar.com’, I might want to be able to search on the whole term, as well as ‘foo’ or ‘bar.com’ and get that document in all cases. And I would not even bother using a dictionary to check whether ‘asp.net’ is one term or should be broken into two tokens, that quickly becomes a maintenance nightmare.

Stemming is important, and there are lots of good stemmers out there. I would just use a plural stemmer, anything more just generates more recall than you need.

Term highlighting and keywords in context are very important for the user to determine whether the document is relevant or now without having to check the document itself. The less work you impose on your users, the more they will use your system.

Finally speed. Take too long to return search results and your users will vote with their feet. Two seconds is ok, less is ideal and more is death. Which brings us back to “to be or not to be”.

Advertisements

One Response to Search engine pet peeves

  1. Pingback: Stop words and minimum term length « François Schiettecatte’s Blog

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: