Stop words and minimum term length

This post on stop words and minimum term length by Peter Zaitsev reminded me of some search engines do’s and don’ts that I posted back in August last year.

To summarize:

Stop lists are evil, don’t use them, modern machines have enough capacity to index, store and search over very large quantities of text. Typically I have found that there is only a 5% difference in index size if you add a stop word list.

There should be no minimum term length, you want to be able to search for “Vitamin A”.

Case is important. The approach I take is to index all terms in lowercase, and also index mixed case terms as they are. Search is always done in the case supplied by the user, so “New York Times” would only find documents which contain capitalized terms, and “new york times” would find all documents which contain the terms regardless of case and capitalization.

Tokenization is important, check my original post on that.

Plural stemming is the way to do, any more (like Porter or Lovins) will just increase the ‘noise’ in the search results.

There is more in the post and I should revise it sometime, maybe this weekend.


3 Responses to Stop words and minimum term length

  1. You’re right, stop words generally do not increase index size dramatically. The problem is they can slow down execution significantly.

    Think about intersecting 50mil document ID lists for “a” and “the” words.

    Unless search engine has special treatment for high frequency words it can slow things down really a lot.

  2. I agree with you, intersecting 50 million document IDs would not be feasible, but there are a number of approaches you can take to deal with this issue.

    Architectural approaches would include sharding data across a number of machines, making sure that indices can be searched in memory without hitting storage. Within a machine you can also further shard data taking advantage of threading and multi-cpu machines.

    Algorithmic approaches would include document ID list truncation, for example a search for ‘Vitamin A’ would likely generate fewer documents for ‘Vitamin” than for ‘A’ so you could limit the range of document IDs you check for the latter term based on the document IDs you get for the former term.

    You can could also apply some smarts whereby you could drop the search for ‘A’ if you searched for ‘A Vitamin’. This is somewhat more complex to implement for obvious reasons, and I would steer clear of that one.

    In the past I have experimented with automatically turning a term into a stop term based on frequency and distribution and that worked well, so with a search like ‘sitting in the laps of the gods’, you could easily drop ‘in’, ‘the’ and ‘of’. Of course you could not do this for phrases.

  3. noel says:

    I’m not experts like you guys, but I am a search engine user and here’s my opinion on stop words. I think they should be indexed definitely but discarded during the search unless quoted. I remember typing dr who into feedster and being extremely annoyed that even “dr who” would not yield results I was after. Stop words aren’t always stop words as you say. Context is crucial. Filter the query, not the index.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: