Google on the iPhone
Google has been doing some interesting work on its iPhone user interface, I have not used it but the screenshots look pretty slick, worth checking out.
Update - ars technica has a good post about this.
Google has been doing some interesting work on its iPhone user interface, I have not used it but the screenshots look pretty slick, worth checking out.
Update - ars technica has a good post about this.
I have written about this before (I think), GigaOM talks about Google’s real competitive advantage, namely the sheer size of the infrastructure they have.
Very interesting paper entitled “The End of an Architectural Era (It’s Time for a Complete Rewrite)” (by way of the High Scalability weblog) comparing traditional RDBMSs (based on System R) with a newer system called H-Store.
Once you get by the sensational title, the paper does a good job of arguing that traditional RDBMSs (read ‘currently shipping from major vendors’) have not evolved to keep up with both the changing hardware and research landscape. It is hard to disagree with this, I was using Sybase SQL Server in 1989 on a Sun Sparcserver 490 and while some things have changed since then, it is surprising how many things are still the same.
But the article fails to answer a very important question for me: If the traditional RDBMSs are truely that poor, why are so many people using them?
Some information was leaked to the web about the Google Reader (by way of Greg Linden).
The three posts in question are here, here and here.
I am not sure I believe all the numbers, but this is interesting stuff nonetheless.
By way of Greg Linden, an article from Microsoft Research titled “HITS on the Web: How does it Compare?” by Marc Najork, Hugo Zaragoza, and Michael Taylor. Finally a large scale study of several ranking algorithms (HITS, BM25F) using web crawl data.
You should read Greg’s post as well the comments to the post, I am in general agreement with them.
There are a couple of points I would make though:
* I think Google went beyong PageRank pretty quickly since it is a well documented and well understood algorithm and therefore can easily be gamed.
By way of Greg Linden, a research article out of Penn State, “The Effect of Brand Awareness on the Evaluation of Search Engine Results” test the value of a brand when it comes to search results.
The test took four sets of search results from Google and modified the look of the page to look like other search engines, in this case, Yahoo, Microsoft Live Search, and AI2RS. Read the paper, it is short and well written.
What was curious to me was that Yahoo ranks higher in terms of precision than everyone else, which would suggest that people found results from Yahoo to be better, even though a majority of the participants regularly used Google.
To me this is an odd contradiction and, as Greg Linden points out, the study needs to be expanded to shed more light on those results.
What it does suggest though is that while Yahoo is trusted to provide better results, clearly that is not enough to cause them to prevail against Google.
I see that there is a new edition of the book “High Performance MySQL” in the works.
That is good to see and I hope that they add a lot to the book. Frankly that would not be difficult because the first edition sucked. I had been using MySQL for all of 6 months when I got it and I did not learn anything new.
On the other hand the book is useful if you don’t know anything about MySQL, but there are a number of better books out there than that one.
Two (1, 2) good articles in the Economist about Google.
Google will face more and more scrutiny as time goes by and will need to deal with it:
One obvious strategy is to allay concerns over Google’s trustworthiness by becoming more transparent and opening up more of its processes and plans to scrutiny. But it also needs a deeper change of heart. Pretending that, just because your founders are nice young men and you give away lots of services, society has no right to question your motives no longer seems sensible. Google is a capitalist tool—and a useful one. Better, surely, to face the coming storm on that foundation, than on a trite slogan that could be your undoing.
Every day I come across sites which have a search box where you can enter a search and get some results. And every day I am disappointed by the functionality (rather the lack thereof) implemented within.
So here is a list of pet peeves:
Don’t use stop lists. Stop lists are bad because they throw away important information. At some point they were useful because computers had very limited resources in terms of CPU cycles and disc space, but no more. I want to be able to retrieve documents which contains “to be or not to be” or “Vitamin A”, neither of which are possible if stop lists are in effect.
The default operator between term should be “AND”. Ok this one is conditional. For a search engine which relies on links to calculate document relevance (such as Google, and others) this is not needed, in fact it may even hurt. But for search engines which search a batch of documents and ranks them according to a tf.idf measure, ANDing the terms is a good thing because the search results will get more precise as the user adds more terms. Which is makes sense. I have seen search engines out there which OR the search terms, so that the search became less and less precise as terms were added to the search.
Detecting phrases in document and ranking those documents higher is a very good idea. Going back to our phrase “to be or not to be”, these are all terms which occur any which where in a document, but in that sequence are very important. Well they are very important to Hamlet.
Google has conditioned us to a set of operators to control the search, such as the use of quotes to indicate phrases, ‘+’ to indicate a required term, and ‘-’ to indicate an unwanted term. Your search engine may (and probably does) support lots of operators, but should support the Google ones, even if this is implemented as a search translation layer.
Tokenization is also important, and needs to be more than just breaking on spaces and punctuation. Good examples of this include such terms as ‘.net’ or ‘C++’, even better ‘asp.net’. A more complication one would be ‘foo@bar.com’, I might want to be able to search on the whole term, as well as ‘foo’ or ‘bar.com’ and get that document in all cases. And I would not even bother using a dictionary to check whether ‘asp.net’ is one term or should be broken into two tokens, that quickly becomes a maintenance nightmare.
Stemming is important, and there are lots of good stemmers out there. I would just use a plural stemmer, anything more just generates more recall than you need.
Term highlighting and keywords in context are very important for the user to determine whether the document is relevant or now without having to check the document itself. The less work you impose on your users, the more they will use your system.
Finally speed. Take too long to return search results and your users will vote with their feet. Two seconds is ok, less is ideal and more is death. Which brings us back to “to be or not to be”.