François Schiettecatte’s Blog

August 27, 2007

Search engine pet peeves

Filed under: Search — François Schiettecatte @ 6:27 pm

Every day I come across sites which have a search box where you can enter a search and get some results. And every day I am disappointed by the functionality (rather the lack thereof) implemented within.

So here is a list of pet peeves:

Don’t use stop lists. Stop lists are bad because they throw away important information. At some point they were useful because computers had very limited resources in terms of CPU cycles and disc space, but no more. I want to be able to retrieve documents which contains “to be or not to be” or “Vitamin A”, neither of which are possible if stop lists are in effect.

The default operator between term should be “AND”. Ok this one is conditional. For a search engine which relies on links to calculate document relevance (such as Google, and others) this is not needed, in fact it may even hurt. But for search engines which search a batch of documents and ranks them according to a tf.idf measure, ANDing the terms is a good thing because the search results will get more precise as the user adds more terms. Which is makes sense. I have seen search engines out there which OR the search terms, so that the search became less and less precise as terms were added to the search.

Detecting phrases in document and ranking those documents higher is a very good idea. Going back to our phrase “to be or not to be”, these are all terms which occur any which where in a document, but in that sequence are very important. Well they are very important to Hamlet.

Google has conditioned us to a set of operators to control the search, such as the use of quotes to indicate phrases, ‘+’ to indicate a required term, and ‘-’ to indicate an unwanted term. Your search engine may (and probably does) support lots of operators, but should support the Google ones, even if this is implemented as a search translation layer.

Tokenization is also important, and needs to be more than just breaking on spaces and punctuation. Good examples of this include such terms as ‘.net’ or ‘C++’, even better ‘asp.net’. A more complication one would be ‘foo@bar.com’, I might want to be able to search on the whole term, as well as ‘foo’ or ‘bar.com’ and get that document in all cases. And I would not even bother using a dictionary to check whether ‘asp.net’ is one term or should be broken into two tokens, that quickly becomes a maintenance nightmare.

Stemming is important, and there are lots of good stemmers out there. I would just use a plural stemmer, anything more just generates more recall than you need.

Term highlighting and keywords in context are very important for the user to determine whether the document is relevant or now without having to check the document itself. The less work you impose on your users, the more they will use your system.

Finally speed. Take too long to return search results and your users will vote with their feet. Two seconds is ok, less is ideal and more is death. Which brings us back to “to be or not to be”.

August 17, 2007

Technorati stumbles

Filed under: Feedster, Search — François Schiettecatte @ 5:38 am

It looks like Technorati has just stumbled.

I am sorry to hear that things are not going well for them, but they are in a tough market, especially since Google got into it with their blogsearch.

Disclaimer - I am stockholder in Feedster, and still consult with them from time to time.

August 13, 2007

Search engine regulation

Filed under: Search — François Schiettecatte @ 12:37 pm

I think we are seeing only the start of a very interesting debate on search engine regulation. This article is a good place to start from IMHO.

The article frames the debate as two major issues, one around search results, and another around privacy, stating its position as follows:

I haven’t been able to read the underlying paper advocating regulation and so can’t comment on the substantive arguments. But regulation, as a general matter, of natural search results would be ill advised. Consumer privacy is a different matter, where regulation is more justified.

It is hard to agree with this positions as it makes a lot of sense, but there are plenty of things a search engine can do to doctor search results. Remember back in web 1.0 days, search engines would frequently doctor the search results, peppering the top hits with ads disguised as search results. Now that search engines have become such a fundamental part of the net, both in terms of infrastructure and economics, the question will be raised whether to regulate them.

But this debate could quickly becomes a quagmire:

General search is not a market without competition and required government involvement/approval of algorithms or changes in algorithms would potentially impede advancements and technological development. Would all search engines (regardless of market share) be subject to similar regulation? Would any site with an algorithm that displays search results? While fairness in search results sounds good to some in the abstract, the practical implementation of such a regulatory scheme is where it all might break down and wreck havoc.

We should also not forget that policy and regulation decisions sometimes have unintended consequences.

August 7, 2007

Google indexing speed

Filed under: Search — François Schiettecatte @ 4:12 pm

Matt Cutts made the comment that the Google indexing speed has improved of late, something that was also picked up by SearchEngineWatch.

I have certainly noticed that, now when I publish an article on my blog it usually takes very little time to appear in the Google index (I run a vanity search, doesn’t everyone?) Though rarely it can take up to 12 hours for an article to appear, and once newer article were indexed before older ones.

I had a very short conversation about indexing speed with a Googler at the Seattle Scalability Conference which went something like this:

Me: Why does it sometimes take a while for new posts on my blog to appear in the index, I mean that sometimes they appear very fast and other times it can take a while.

Googler: Well it can take a while for the rankings to converge while indexing, there are over a hundred signals which are taken into account.

Me: Well why not just index the new posts with a ranking extrapolated from previous posts for that feed, and work out a correct ranking at a later date.

Googler: Interesting idea.

Not sure if they took me up on that, but I like the fact that indexing is much faster.

August 5, 2007

Tugging the lion’s whiskers

Filed under: General, Search — François Schiettecatte @ 2:13 pm

I particularly liked Greg Linden’s post titled “Google teasing too many lions?“. I had read the orginal post by Robert Cringely titled “Is Google on crack?”, but read it is such a hurry that I missed its points and implications, which Greg caught.

I am wondering if Google is indeed teasing too many lions?

On one hand, Google has an estimated 10,000 employees and if they can’t handle all they are doing with that kind of staff then I would question their management’s abilities to run the business. I think it is good for them to diversify to spread the risks, seeing what sticks and what doesn’t (remember Froogle.) That only makes sense to me. The only quibbles I have is that they don’t seem to cull bad ideas quickly enough, some things seem to stay in beta well past their 1.0 date, and some businesses they buy just whither away.

On the other hand, maybe there is too much diversification going on and Google is going after too many established businesses, creating a lot of enemies along the way. All these things are going to be significant distractions from their core business, making them vulnerable in the long run.

I think there are good arguments on both sides, but on balance I think it is good for Google to test new business ideas and challenge existing businesses. If anything that will wake up existing businesses who are too entrenched to generate competition and create value, both of which benefit the consumers. If Google succeeds, we as consumers stand to benefit, and if they fail, it will probably be because they tugged the incumbent’s whiskers getting to wake up and react.

The one thing I would caution here to Google is not to get arrogant, this is usually a very effective way to loose customers. A degree of humility is a good thing.

Disclaimer - I own Google stock.

August 1, 2007

Sphinx full text indexing with MySQL

Filed under: Search — François Schiettecatte @ 10:57 am

IBM DeveloperWorks has a good primer article on Sphinx, a full text search engine which integrates with MySQL.

I read the documentation on Sphinx a few weeks ago and it appears to be a reasonably good search engine, though I have not tested it yet.

There was one thing in the article that caught my attention. While the author says that Lucene does not have a PHP API, Solr which front-ends Lucene has a number of output options, one of which is specific to PHP.

Lucene itself requires a fair amount of work to get going since it is a toolkit, Solr provides a very flexible front-end to it making it easy to create new indices, add and delete documents, and provides a powerful search interface. There is no integration with MySQL as such, but it would be pretty easy to write an extractor to pull data out of a MySQL database, convert the data to XML (which is what Solr expects) and pipe that data into Solr. I did that for MySQL dump files and it took me about 30 minutes write, debug and document it.

July 24, 2007

OpenSearch

Filed under: Scaling, Search, Software Development — François Schiettecatte @ 9:24 pm

I have written about OpenSearch before, and came across this article on it on xml.com.

The articles doesn’t break any new ground but is just a quick overview of the protocol.

July 23, 2007

Privacy, are we having the right debate?

Filed under: General, Search — François Schiettecatte @ 7:57 am

It seems like all the major search engines are falling over themselves announcing new privacy initiatives. All this is very laudable, I think it is important to have clearly defined privacy policies but I am wondering if we are actually having the right debate?

I think there are four key questions we need to look at:

  • The first is what data being stored. Currently consumers generate a lot of data as they browse the web, search histories, pages viewed, email, documents, etc… A lot of that data can be aggregated too, providing a wealth of data. I think we understand that a search engine collects that data, but I am most interested in the intersection Google and DoubleClick data.
  • The second is what that data is being used for. This flows out naturaly from the first question. Looking at search histories, pages viewed, a search engine will be able to detect trends and recommend pages we might not have otherwise found, eventually personalizing the search results. Better ad targetting is a no-brainer too. I am also very interested to know what cross-purposes the data is being put to, for example my search history being used to provide additional signals for ad targetting when I am reading my email online.
  • The third is what the data retention policy is. This is where all the action seems to be these days, how long is the data stored for, how long cookies remain active for, when is data anonymised and how. Shortening cookie expirations is privacy theater. And it has also been shown that anonymised logs are far from anonymous. Also there may be legal requirements to store data for certain lengths of time.
  • The fourth is under what circumstances data is disclosed to law enforcement agencies. This does not seem to have been all that well addressed. For example when the FBI asked the major search engines for data, all but Google rolled over and gave up the data requested. What was interesting about this is that the FBI did not press their case with Google which suggests they were on shaky legal grounds in the first place, yet everyone except Google complied.

I think it is a given that data about our browsing habits will be stored and used. This is the principal manner in which service providers learn about us have the means to provide a better browsing experience (personalization is a big factor here.)

What is important for us consumers to understand is how this data is used, aggregated, disseminated, retained and purged. At which point it will be easier to determine whether the loss of privacy is worth it.

And so far I have yet to see comprehensive information from any service providers about that.

July 22, 2007

Search, personalization & privacy

Filed under: General, Search — François Schiettecatte @ 12:32 pm

I just finished reading an interesting article about search and personalization written by Gord Hotchkiss.

Which got me thinking about search and personalization, and specifically about privacy.

First on search and personalization, I think the article put it very succinctly:

Personalization, in its simplest form, is simply knowing more about you as an individual and using that knowledge to better connect you to content and functionality on the Web.

Which tells us that the more you know about a person, the more you can personalize search results to match what they are looking for.

The articles goes on to say:

We’re trying to paint personalization into a corner based on Google’s current implementation of it. And that’s absolutely the wrong thing to do. Personalization is not a currently implemented algorithm, or even some future version of the same algorithm. It’s is an area of development that will encompass many new technologies, some of which are under development right now in some corner of Google’s labs.

I think this makes two very important points.

The first is that current personalization implementations are pretty poor, I don’t think many people would disagree that they have been pretty disappointing to date.

The second is that personalization will get better over time, but that two things will need to happen, one technological and the other social. On the technology side, new personalization implementations will have to pull in research from other areas, one obvious one is data mining and there are plenty of others. On the social side, we as users will have to get much more comfortable sharing data about ourselves with whatever personalization tools are created. Currently we share very little data, namely short searches, pages viewed and search history. For any system to be truely personalized, we are going to have to share a more data than that, a lot more.

And this is where things will get interesting, there will be usual outcry about privacy, but consumers have shown themselves again and again to be willing to part with privacy in return for convenience.

So the onus is on these new personalization technologies to really deliver.

Updated, of course I should have linked to Sepandar Kamvar who is the technical lead of personalization at Google.

July 18, 2007

The Importance of being cached

Filed under: Feedster, Search — François Schiettecatte @ 8:37 pm

By way of Greg Linden, I read this very interesting paper from Yahoo Research about caching called “The Impact of Caching on Search Engines“.

I liked the dicussion on term versus search caching. My experience is that term caching does not really buy you much if all you are doing is caching a posting list since that is what is stored in the index. Caching terms would make more sense if there is a field restriction on the term, but most terms don’t have field restrictions. Caching a search makes a lot more sense, and caching portions of searches also makes a lot of sense. In the search engine I developed for Feedster, I implemented both. The searches were cached, and the filters in searches were also cached. By filters I mean that we had a number of searches which were restricted to a reduced set of weblogs and these restrictions were implemented using a filter expression which was separate from the actually user search. This is pretty standard stuff, and I found that caching the filter results improved performance.

I am not sure where I stand on dynamic versus static caching though. I am not sure I make much of a distinction, I implemented a dynamic cache, ie I would cache the results if they were not already cached, but I did not set a limit to the cache, and I did not ‘warm’ the cache from search logs.

Chad Walters also has some interesting thoughts on this.

« Newer PostsOlder Posts »

Blog at WordPress.com.