Greg Linden (a must-read blog because he picks up new publications very quickly) has a good post aggregating a number of papers from WWW 2008 on crawling and why crawling is hard.
I wrote the version one crawler for Feedster (version zero was not very good and got ditched very quickly) and it is very difficult to write a good crawler. It is basically a balancing act, currency versus bandwidth usage, etc…
I finished writing a crawler a month or so ago for the current project I am working on and it took me a while to adjust the crawl interval based on how frequently a feed changed. I am not sure I have it quite right yet and the algorithm still needs more adjustment.
I think Narayan Newton does a very good job of summarizing the pros and cons of MyISAM and Innodb in this post “MySQL engines, MyISAM vs. Innodb”.
I have seen a lot written about this before but I think his post neatly summarizes the arguments on both sides and as worth reading if you are having to make a decision about this.
My reflex is to always use Innodb unless there is a compelling reason for using MyISAM, and it has to be really, really compelling.
I did take issue with one point he makes which he illustrates with an experience:
On the other hand, InnoDB is a largely ACID (Atomicity, Consistency, Isolation, Durability) engine, built to guarantee consistency and durability. It does this through a transaction log (with the option of a two-phase commit if you have the binary log enabled), a double-write buffer and automatic checksumming and checksum validation of database pages. These safety measures not only prevent corruption on “hard” shutdowns, but can even detect hardware failure (such as memory failure/corruption) and prevent damage to your data.
Drupal.org has made use of this feature of InnoDB as well. The database in question contains a large amount of user contributed content, cvs messages, cvs history, forum messages, comments and, more critically, the issue queues for the entire Drupal project. This is not data where corruption is an option. In 2007, the master database server for the project went down. After examining the logs, it became clear that it hadn’t crashed as such, but InnoDB had read a checksum from disk that didn’t match the checksum it had in memory. In this case, the checksum miss-match was a clear sign of memory corruption. Not only did it detect this, but it killed the MySQL daemon to prevent data corruption. In fact, it wouldn’t let the MySQL daemon run for more than a half hour on that server without killing it after finding a checksum miss-match. When your data is of the utmost importance, this is very comforting behavior.
I have certainly had this happen to me, once or twice, and it is very satisfying to see Innodb recover and carry on on its merry way. However I did experience very nasty hardware failure where a RAID controller went nuts and sprayed bad data out to storage, Innodb won’t prevent this, the database had turned into a small pile of bit goo and Innodb was not able to recover it regardless of how high the ACP(*) innodb_force_recovery was set. We had to switch to backup and wipe the original system clean. It is probable that MyISAM would have been able to recover the database because of its simpler structure.
(*) ass-covering parameter.
Following up on my last post on read replication with MySQL, I read this post by Greg Linden on the subject of caching which mirrors my thinking on the matter (except that his is better written):
My opinion on this differs somewhat. I agree that read-only replication is at best a temporary scaling solution, but I disagree that object caches are the solution.
I think caching is way overdone, to the point that, in some designs, the caching layers sometimes contains more machines than the database layer. Caching layers add complexity to the design, latency on a cache miss, and inefficiency to use of cluster resources.
My experience at Feedster confirms this, once we got powerful enough servers for the DBMS, we found that we did not need to use memcached at all, in fact it was a hinderance more than anything because it added to the number of machines that needed to be administered.
As a small postscriptum, this post by Ronald Bradford does a very good job of listing out the reasons for replication along with the advantages and disadvantages of each.
I have been following the thread about the death of read replication over on the Planet MySQL weblog with interest. In with this issue the notion of caching is thrown in to illustrate that it can be used as a substitute to read replication. (See this, this and this.)
Personally I think the two issues are separate and should be treated as such, and I will be basing this on my experiences at Feedster scaling a MySQL database from about 1GB to around 1.5TB.
Initially we relied on read-replication to shift the read load from the master server to alternative read servers. For a while this worked, but as our hardware got better (post-funding) we found that the read servers were not keeping up with replication. After some amount of digging and consultation, what became very clear to me was that the read servers were never going to catch up for a very simple reason.
While the master server and the read servers were roughly the same in terms of capacity, the issues were that the read server was having to support the same write load as the master server and, in addition, a much higher read load. Combine that with the fact that replication takes places in a single thread (whereas the master uses multiple threads to write data), and you have a situation where the read servers cannot catch up with the master server.
There are a couple of tricks you can employ to make the slave servers faster, one is to do the replication across multiple threads using a script (which I have done) but you lose referential integrity, the other is to write a utility which pre-reads the replication log and accesses relevant rows before they are accessed to make sure that replication is not slowed down waiting for data to be read off storage (this was the solution implemented by YouTube for a while).
Looping back to read replication. I agree that read replication is dead, and it should be. Replication should be used for backup purposes only, which is what we eventually did at Feedster. And your replication server should be ready to take over if the master server fails.
Onto the second issue of caching. The caching that memcached does is actually pretty simplistic. You can cache a ‘chunk of data’ somewhere and access it later if it has not been flushed to make room for other ‘chunks of data’. I say ‘chunk of data’ because that is how memcached sees it, you are responsible for serializing the data (flattening it in a contiguous area of memory) and decoding it when you get it back. Caching makes sense if it takes you more time to get data out of your database than it does getting it from cache. Ideally you want to be in a situation where you don’t need to use caching because you get get to your data fast enough. Getting to that point means having an optimized schema and a sharded database so you can take advantage of the bandwidth that multiple machine afford you. The point is to take the memory you would use for caching and give it to your database servers.
I think this post by Greg Linden (pointing to two other posts, one by Jason Calcanis and the other by Xeni Jardin) highlights the fundamental problem with “Web 2.0″, which is that any system that gains traction will be swamped in spam.
We had this issue at Feedster going back to the beginning. It is very easy to poison the ‘pings’ and ‘feedmesh’ streams, and the Feedster ‘pings’ API was getting mostly spam. I did a quick web search and turned up two tools which allow a user to create an RSS feed and submit it to all the major search engines with a simple click. We also much more complex schemes to get spam into the search engine, as well as search bots ripping content wholesale, as well as bots trying to DOS our system.
It looks like Technorati has just stumbled.
I am sorry to hear that things are not going well for them, but they are in a tough market, especially since Google got into it with their blogsearch.
Disclaimer - I am stockholder in Feedster, and still consult with them from time to time.
By way of Greg Linden, I read this very interesting paper from Yahoo Research about caching called “The Impact of Caching on Search Engines“.
I liked the dicussion on term versus search caching. My experience is that term caching does not really buy you much if all you are doing is caching a posting list since that is what is stored in the index. Caching terms would make more sense if there is a field restriction on the term, but most terms don’t have field restrictions. Caching a search makes a lot more sense, and caching portions of searches also makes a lot of sense. In the search engine I developed for Feedster, I implemented both. The searches were cached, and the filters in searches were also cached. By filters I mean that we had a number of searches which were restricted to a reduced set of weblogs and these restrictions were implemented using a filter expression which was separate from the actually user search. This is pretty standard stuff, and I found that caching the filter results improved performance.
I am not sure where I stand on dynamic versus static caching though. I am not sure I make much of a distinction, I implemented a dynamic cache, ie I would cache the results if they were not already cached, but I did not set a limit to the cache, and I did not ‘warm’ the cache from search logs.
Chad Walters also has some interesting thoughts on this.
Google has an interesting way of dealing with people who launch too many searches from their computers.
While working at Feedster I routinely saw the same search coming in many times a second, usually from the same subnet or the same computer. In some cases these were due to crawlers gone wild, but in other cases the pattern looked malicious.
What we did to handle that was to put in place a way of measuring the search rate coming in and looking for hight numbers of searches from the same subnet or the same computer. When we could contact an admin to check into it we did so, but if there was no contact, we initially throttled the searches, effectively putting in a delay. If that did not help, we would just cut off the IP address.
I remember two patterns to the searches, one was that a lot of them came from China, and the other is that a lot of them were looking for kiddie porn.
I think Google handles this problem better than we did at Feedster, then again they have a lot more engineers.
Feedster has a new look and new features, channels and widgets, allowing you to build what they are calling feedwidgets for your blog and/or web site.
I personally know most of the people who worked on this new design and features, props to them for putting this together.