Stop words and minimum term length
February 9, 2008 3 Comments
Stop lists are evil, don’t use them, modern machines have enough capacity to index, store and search over very large quantities of text. Typically I have found that there is only a 5% difference in index size if you add a stop word list.
There should be no minimum term length, you want to be able to search for “Vitamin A”.
Case is important. The approach I take is to index all terms in lowercase, and also index mixed case terms as they are. Search is always done in the case supplied by the user, so “New York Times” would only find documents which contain capitalized terms, and “new york times” would find all documents which contain the terms regardless of case and capitalization.
Tokenization is important, check my original post on that.
There is more in the post and I should revise it sometime, maybe this weekend.