Detecting Spam just from HTTP headers

By way of Geeking with Greg, a paper on detecting spam just from HTTP headers, “Predicting Web Spam with HTTP Session Information” (PDF).

This is very interesting and I will be taking a close look at this. At Feedster dealing with spam (technically ‘splogs’) was tough challenge because it is very easy to make them look like bona-fide blogs, making it difficult to use normal spam detection methods to identify them with getting lots of false positives in the process.

We also got searches from ‘sploggers’ trawling for content to add to their ‘splogs’, while it was difficult to identify ‘regular’ searches from the mass of searches we got, I did make the decision at some point to reject searches for porn (and specifically child porn) using a simple keyword match (no censorship there, the searches I rejected were blatant). Aside from the nature of the search, I did not want them to soak up more resources then they deserved.

We also got searches which contained what looked like MD5 signatures, which I assume were ‘sploggers’ checking to see if their content got into the index.

And finally, we would occasionally get what looked like Denial Of Service attacks from various sites, usually the same search 50,000 times a day or more. What was surprising was that some of these would come from other search engines. I killed those searches too.


MySQL 5.1 Goes GA, Finally…

I was happy to see that MySQL 5.1 went GA on November 27th, it has taken a very long time to get there.

I use it for my current project, in fact I selected it over 5.0 in the expectation that it would go GA while my work was in progress.

We tried it out while I was at Feedster, in the fall of 2006 and rejected it pretty quickly because it was way too unstable and opted for 5.0 instead.

MySQL 5.1.29 Released

MySQL 5.1.29 was just released, the final release candidate on the way to general availability. I have been running 5.1.28 for a while with no issues.

The main change I see here (for me at least) is the ability to change logging without having to restart the server.

Log on demand is one of the main features of MySQL 5.1. It means that you can enable and disable general logs and slow query logs without restarting the server. You can also change the log file names, again without a restart.

What was missing was the ability of setting a log file name in the options file, without actually starting the logging. The old option log, could set the general log file name, but it will also start the logging immediately. If you want to set the log file name without activating the logging, now you can use general_log_file=filename or slow_query_log_file=file_name in the options file. These features were already available as system variables, but not as startup options.

From the “software takes time department”, I first started playing around with the 5.1 release while at Feedster at the end of 2006, but we chose not to use it because it was just too unstable.

Caching and system optimization

Greg Linden mentions an interesting paper out of Yahoo Research and presented at SIGIR 2008 “ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines”.

The excerpt that Greg republishes in his post talks about how using a pruned index worked against having a cache on the main index since only singleton searches would reach the main index.

Which got me thinking about the index we had at Feedster. I actually implemented the search engine there, and managed its day to day running, so I had quite a bit to say on how it was organized.

The index was broken down into segments (shard if you like) of 1 million entries. The entries were added to the index segments in the order they were crawled, a new index segments being generated every 10 minutes, in other words we would be adding new entries to an index segments until we reached 1 million entries and then we would start a new index segments. We had only a limited number of machines on which to run the search engines, so we organized the overall index into a pipeline, new segments were added to the top and older segments were removed from the bottom and placed into an archive. While the overall index was quite big (about 750GB if you must know), we usually only had about 45 days’ worth of posts seachable, which was fine because our default sort order was reverse chronological and we could satisfy most search terms with that. The 45 days of indices (about 66 index segments) were spread across five machines and this whole setup was replicated for redundancy and load balancing. In front of those there were five search gateways which would handle searches coming from the front end. The search gateway would autoconfigure themselves, searching the local net for search engines, figuring out what was there and putting together a map of the index segments, presenting that as a single index to the front end. So search machines could go down and come back up and stuff would just work, and I did not need to tell the search gateways where the indices were located.

The search engines themselves would create local caches of search results for each index segment, any cache file older than seven days would be deleted to make space for new cache files. One feature I implemented in the cache was a switch which would allow me to cache either the whole search or components of the search. For example, searches could be restricted to a set of blogs, and that restriction was implemented as a search, so it made sense to cache both the search and the restriction separately so that they could be reused for other searches which contained either components.

In addition I implements automatic index pruning, where the search engine (and the search gateway) would know the order of the index, ie reverse chronological, and would search each index segment in turn until the search was satisfied. Users could also control this index pruning from within the search through search modifiers. In fact I implemented a long list of search modifiers most of which we did not document on the site for one reason or another.

At peak times the search engine was handling about 1.5 million searches a day, since the searches were spread across pairs of machines, each machine was handling 750,000 searches/day over an index of 14 million entries.

Scaling MySQL at Facebook

By way of Greg Linden, some interesting notes and figures from various high traffic web sites on scaling MySQL.

As Greg points out, Facebook’s strategy is to partition the data and spread it across a lot of servers, which is pretty much the only way to go if you want to scale MySQL, or any site for that matter.

Crawling is indeed harder than it looks

Greg Linden (a must-read blog because he picks up new publications very quickly) has a good post aggregating a number of papers from WWW 2008 on crawling and why crawling is hard.

I wrote the version one crawler for Feedster (version zero was not very good and got ditched very quickly) and it is very difficult to write a good crawler. It is basically a balancing act, currency versus bandwidth usage, etc…

I finished writing a crawler a month or so ago for the current project I am working on and it took me a while to adjust the crawl interval based on how frequently a feed changed. I am not sure I have it quite right yet and the algorithm still needs more adjustment.

MySQL engines, MyISAM vs. Innodb

I think Narayan Newton does a very good job of summarizing the pros and cons of MyISAM and Innodb in this post “MySQL engines, MyISAM vs. Innodb”.

I have seen a lot written about this before but I think his post neatly summarizes the arguments on both sides and as worth reading if you are having to make a decision about this.

My reflex is to always use Innodb unless there is a compelling reason for using MyISAM, and it has to be really, really compelling.

I did take issue with one point he makes which he illustrates with an experience:

On the other hand, InnoDB is a largely ACID (Atomicity, Consistency, Isolation, Durability) engine, built to guarantee consistency and durability. It does this through a transaction log (with the option of a two-phase commit if you have the binary log enabled), a double-write buffer and automatic checksumming and checksum validation of database pages. These safety measures not only prevent corruption on “hard” shutdowns, but can even detect hardware failure (such as memory failure/corruption) and prevent damage to your data. has made use of this feature of InnoDB as well. The database in question contains a large amount of user contributed content, cvs messages, cvs history, forum messages, comments and, more critically, the issue queues for the entire Drupal project. This is not data where corruption is an option. In 2007, the master database server for the project went down. After examining the logs, it became clear that it hadn’t crashed as such, but InnoDB had read a checksum from disk that didn’t match the checksum it had in memory. In this case, the checksum miss-match was a clear sign of memory corruption. Not only did it detect this, but it killed the MySQL daemon to prevent data corruption. In fact, it wouldn’t let the MySQL daemon run for more than a half hour on that server without killing it after finding a checksum miss-match. When your data is of the utmost importance, this is very comforting behavior.

I have certainly had this happen to me, once or twice, and it is very satisfying to see Innodb recover and carry on on its merry way. However I did experience very nasty hardware failure where a RAID controller went nuts and sprayed bad data out to storage, Innodb won’t prevent this, the database had turned into a small pile of bit goo and Innodb was not able to recover it regardless of how high the ACP(*) innodb_force_recovery was set. We had to switch to backup and wipe the original system clean. It is probable that MyISAM would have been able to recover the database because of its simpler structure.

(*) ass-covering parameter.