<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: Stop words and minimum term length</title>
	<atom:link href="http://fschiettecatte.wordpress.com/2008/02/09/stop-words-and-minimum-term-length/feed/" rel="self" type="application/rss+xml" />
	<link>http://fschiettecatte.wordpress.com/2008/02/09/stop-words-and-minimum-term-length/</link>
	<description>Thoughts from the edge of the 'net</description>
	<pubDate>Mon, 13 Oct 2008 11:56:50 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: noel</title>
		<link>http://fschiettecatte.wordpress.com/2008/02/09/stop-words-and-minimum-term-length/#comment-4107</link>
		<dc:creator>noel</dc:creator>
		<pubDate>Thu, 21 Feb 2008 19:03:33 +0000</pubDate>
		<guid isPermaLink="false">http://fschiettecatte.wordpress.com/?p=404#comment-4107</guid>
		<description>I'm not experts like you guys, but I am a search engine user and here's my opinion on stop words.  I think they should be indexed definitely but discarded during the search unless quoted.  I remember typing dr who into feedster and being extremely annoyed that even "dr who" would not yield results I was after.  Stop words aren't always stop words as you say.  Context is crucial.  Filter the query, not the index.</description>
		<content:encoded><![CDATA[<p>I&#8217;m not experts like you guys, but I am a search engine user and here&#8217;s my opinion on stop words.  I think they should be indexed definitely but discarded during the search unless quoted.  I remember typing dr who into feedster and being extremely annoyed that even &#8220;dr who&#8221; would not yield results I was after.  Stop words aren&#8217;t always stop words as you say.  Context is crucial.  Filter the query, not the index.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: François Schiettecatte</title>
		<link>http://fschiettecatte.wordpress.com/2008/02/09/stop-words-and-minimum-term-length/#comment-4102</link>
		<dc:creator>François Schiettecatte</dc:creator>
		<pubDate>Sat, 09 Feb 2008 18:06:14 +0000</pubDate>
		<guid isPermaLink="false">http://fschiettecatte.wordpress.com/?p=404#comment-4102</guid>
		<description>I agree with you, intersecting 50 million document IDs would not be feasible, but there are a number of approaches you can take to deal with this issue.

Architectural approaches would include sharding data across a number of machines, making sure that indices can be searched in memory without hitting storage. Within a machine you can also further shard data taking advantage of threading and multi-cpu machines.

Algorithmic approaches would include document ID list truncation, for example a search for 'Vitamin A' would likely generate fewer documents for 'Vitamin" than for 'A' so you could limit the range of document IDs you check for the latter term based on the document IDs you get for the former term. 

You can could also apply some smarts whereby you could drop the search for 'A' if you searched for 'A Vitamin'.  This is somewhat more complex to implement for obvious reasons, and I would steer clear of that one.

In the past I have experimented with automatically turning a term into a stop term based on frequency and distribution and that worked well, so with a search like 'sitting in the laps of the gods', you could easily drop 'in', 'the' and 'of'. Of course you could not do this for phrases.</description>
		<content:encoded><![CDATA[<p>I agree with you, intersecting 50 million document IDs would not be feasible, but there are a number of approaches you can take to deal with this issue.</p>
<p>Architectural approaches would include sharding data across a number of machines, making sure that indices can be searched in memory without hitting storage. Within a machine you can also further shard data taking advantage of threading and multi-cpu machines.</p>
<p>Algorithmic approaches would include document ID list truncation, for example a search for &#8216;Vitamin A&#8217; would likely generate fewer documents for &#8216;Vitamin&#8221; than for &#8216;A&#8217; so you could limit the range of document IDs you check for the latter term based on the document IDs you get for the former term. </p>
<p>You can could also apply some smarts whereby you could drop the search for &#8216;A&#8217; if you searched for &#8216;A Vitamin&#8217;.  This is somewhat more complex to implement for obvious reasons, and I would steer clear of that one.</p>
<p>In the past I have experimented with automatically turning a term into a stop term based on frequency and distribution and that worked well, so with a search like &#8217;sitting in the laps of the gods&#8217;, you could easily drop &#8216;in&#8217;, &#8216;the&#8217; and &#8216;of&#8217;. Of course you could not do this for phrases.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Zaitsev</title>
		<link>http://fschiettecatte.wordpress.com/2008/02/09/stop-words-and-minimum-term-length/#comment-4098</link>
		<dc:creator>Peter Zaitsev</dc:creator>
		<pubDate>Sat, 09 Feb 2008 15:51:41 +0000</pubDate>
		<guid isPermaLink="false">http://fschiettecatte.wordpress.com/?p=404#comment-4098</guid>
		<description>You're right, stop words generally do not increase index size dramatically.  The problem is they can slow down execution significantly.

Think about intersecting 50mil document ID lists for  "a" and "the"  words. 

Unless search engine has special treatment for high frequency words it can slow things down really a lot.</description>
		<content:encoded><![CDATA[<p>You&#8217;re right, stop words generally do not increase index size dramatically.  The problem is they can slow down execution significantly.</p>
<p>Think about intersecting 50mil document ID lists for  &#8220;a&#8221; and &#8220;the&#8221;  words. </p>
<p>Unless search engine has special treatment for high frequency words it can slow things down really a lot.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
