“The Triumph of the Wisdom of the Mob.”
By way of Greg Linden, Nicholas Carr writes:
What we seem to have here is evidence of a fundamental failure of the Web as an information-delivery service. Three things have happened, in a blink of history’s eye: (1) a single medium, the Web, has come to dominate the storage and supply of information, (2) a single search engine, Google, has come to dominate the navigation of that medium, and (3) a single information source, Wikipedia, has come to dominate the results served up by that search engine.
Even if you adore the Web, Google, and Wikipedia – and I admit there’s much to adore – you have to wonder if the transformation of the Net from a radically heterogeneous information source to a radically homogeneous one is a good thing. Is culture best served by an information triumvirate?
It’s hard to imagine that Wikipedia articles are actually the very best source of information for all of the many thousands of topics on which they now appear as the top Google search result.
What’s much more likely is that the Web, through its links, and Google, through its search algorithms, have inadvertently set into motion a very strong feedback loop that amplifies popularity and, in the end, leads us all, lemminglike, down the same well-trod path – the path of least resistance. You might call this the triumph of the wisdom of the crowd. I would suggest that it would be more accurately described as the triumph of the wisdom of the mob.
I feel like we have seen this before with blogs where for a while they dominated the top ranking of search results, and at some point this was fixed. This is a little more complicated politically since this is about a single site rather than lots of smaller sites and Google does have a competitor to Wikipedia, so making any changes would necessarily be sensitive.
Clearly both Google and Wikipedia need better competition in the marketplace.
Lucene Gets Support
I have known for a while that Lucid Imagination was in the works, I guess this article makes it official:
“Lucene and Solr are probably some of the most underestimated players in this market,” said CEO Eric Gries. The Solr search server provides a front end to the core Lucene search engine library.
Lucene is in use at more than 4,000 organizations worldwide, according to Lucid Imagination. But companies that are starting to use the search engine for mission-critical applications can’t rely solely “on the good graces of the [open-source] community” for support and quick bug fixes, Gries said.
Demo Work
I have been spending most of this long weekend working on a demo which required me to use SOLR/Lucene. I have not looked at it in any depth for about a year and looks like things have moved along nicely since.
Two things which still irritate me, it seemed to be pretty hard to set up multiple indices on a server, maybe I am missing something, and there is no support for distributed searching, at least not built into the product.
Still, it is really easy to work with.
Full Text Indices in MySQL
Interesting post about full text indices in MySQL:
I rarely use MySQL Fulltext indexes. Their performance is just not good enough, so often its better to just stick with “LIKE” or move to something else like Sphinx, Lucene etc. The only nice thing about them is the ability to compute a match “rank”. Well anyways I had to write a new search plugin for a project that is based around MySQL Fulltext indexes and a match rank and all as well .. except that for some reasons some words just would not produce any results. As I was trying to find a pattern I finally noticed that in my test data some words were used in most rows and exactly those were not matching. Obviously it makes sense to exclude automatically any words that have a very high hit ratio. And indeed the documentation states that by default all words with a hit ratio of over 50% are excluded. Doh!
The performance is indeed not very good. Way back in time Feedster started of using the MySQL full text index and performance really sucked, we gave up after having added about 1 million posts to the database (and handling about 5000 searches a day), and moved to the full text search engine I had written for the task (in all fairness Scott and I were still merging the two systems we had developed, we kept his U/I and database schema, and kept my crawler and full text search engine.)
We moved to a system where the indexer would pull recently added data from the database, index that and make it available for searching, effectively making MySQL a repository. Of course it makes much more sense to have the crawler queue up data for the indexer and bypass the repository completely.
State of the Blogosphere
Technorati has published a “State of the Blogosphere” for 2008 (via SearchEngineWatch.)
Worth taking a quick look.
Caching and system optimization
Greg Linden mentions an interesting paper out of Yahoo Research and presented at SIGIR 2008 “ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines”.
The excerpt that Greg republishes in his post talks about how using a pruned index worked against having a cache on the main index since only singleton searches would reach the main index.
Which got me thinking about the index we had at Feedster. I actually implemented the search engine there, and managed its day to day running, so I had quite a bit to say on how it was organized.
The index was broken down into segments (shard if you like) of 1 million entries. The entries were added to the index segments in the order they were crawled, a new index segments being generated every 10 minutes, in other words we would be adding new entries to an index segments until we reached 1 million entries and then we would start a new index segments. We had only a limited number of machines on which to run the search engines, so we organized the overall index into a pipeline, new segments were added to the top and older segments were removed from the bottom and placed into an archive. While the overall index was quite big (about 750GB if you must know), we usually only had about 45 days’ worth of posts seachable, which was fine because our default sort order was reverse chronological and we could satisfy most search terms with that. The 45 days of indices (about 66 index segments) were spread across five machines and this whole setup was replicated for redundancy and load balancing. In front of those there were five search gateways which would handle searches coming from the front end. The search gateway would autoconfigure themselves, searching the local net for search engines, figuring out what was there and putting together a map of the index segments, presenting that as a single index to the front end. So search machines could go down and come back up and stuff would just work, and I did not need to tell the search gateways where the indices were located.
The search engines themselves would create local caches of search results for each index segment, any cache file older than seven days would be deleted to make space for new cache files. One feature I implemented in the cache was a switch which would allow me to cache either the whole search or components of the search. For example, searches could be restricted to a set of blogs, and that restriction was implemented as a search, so it made sense to cache both the search and the restriction separately so that they could be reused for other searches which contained either components.
In addition I implements automatic index pruning, where the search engine (and the search gateway) would know the order of the index, ie reverse chronological, and would search each index segment in turn until the search was satisfied. Users could also control this index pruning from within the search through search modifiers. In fact I implemented a long list of search modifiers most of which we did not document on the site for one reason or another.
At peak times the search engine was handling about 1.5 million searches a day, since the searches were spread across pairs of machines, each machine was handling 750,000 searches/day over an index of 14 million entries.
Mobile search and Google
This particular article about Google and mobile search on Ars Technica caught my eye. The opening paragraph says it all really:
Google managed to spank the rest of the mobile search world during the first quarter of 2008, according to data from Nielsen Mobile. The search giant managed to capture 61 percent of the mobile search market in the first four months of the year, with Yahoo! taking a very distant second at 18 percent. MSN sat at third place with a measly 5 percent.
I mention this because about six months ago I was tapped to join a mobile search startup (which will remain nameless). I did about a days worth of research in the mobile search segment and some things jumped out at me pretty quickly.
The carriers have done a very poor job of marketing their own services to their customers. For example, I have been an ATT customer (and Cingular before that) since 1999 and I never knew that they have a property called MediaMall and I consider myself somewhat tech savvy. This is in fact a problem for them since I have not built up any loyalty to this service.
Smart phones are the way forward, that is really clear to me, and with these smart phones comes a web browser with a connection to the internet completely allowing their customers to bypass anything the carrier may have to offer and go directly to the services they already use from their computer.
Which made it really clear to me that online search wold belong to Google once they started moving in on that space, first via browsers on smart phones (like the iPhone), and via their own cell phone platform, namely symbian.
Looking out I could see two trends, one going up and the other going down, no prizes for guessing which is what.
Scaling MySQL at Facebook
By way of Greg Linden, some interesting notes and figures from various high traffic web sites on scaling MySQL.
As Greg points out, Facebook’s strategy is to partition the data and spread it across a lot of servers, which is pretty much the only way to go if you want to scale MySQL, or any site for that matter.
Crawling is indeed harder than it looks
Greg Linden (a must-read blog because he picks up new publications very quickly) has a good post aggregating a number of papers from WWW 2008 on crawling and why crawling is hard.
I wrote the version one crawler for Feedster (version zero was not very good and got ditched very quickly) and it is very difficult to write a good crawler. It is basically a balancing act, currency versus bandwidth usage, etc…
I finished writing a crawler a month or so ago for the current project I am working on and it took me a while to adjust the crawl interval based on how frequently a feed changed. I am not sure I have it quite right yet and the algorithm still needs more adjustment.
Read replication with MySQL – part deux
Following up on my last post on read replication with MySQL, I read this post by Greg Linden on the subject of caching which mirrors my thinking on the matter (except that his is better written):
My opinion on this differs somewhat. I agree that read-only replication is at best a temporary scaling solution, but I disagree that object caches are the solution.
I think caching is way overdone, to the point that, in some designs, the caching layers sometimes contains more machines than the database layer. Caching layers add complexity to the design, latency on a cache miss, and inefficiency to use of cluster resources.
My experience at Feedster confirms this, once we got powerful enough servers for the DBMS, we found that we did not need to use memcached at all, in fact it was a hinderance more than anything because it added to the number of machines that needed to be administered.
As a small postscriptum, this post by Ronald Bradford does a very good job of listing out the reasons for replication along with the advantages and disadvantages of each.






leave a comment