Detecting Spam just from HTTP headers

By way of Geeking with Greg, a paper on detecting spam just from HTTP headers, “Predicting Web Spam with HTTP Session Information” (PDF).

This is very interesting and I will be taking a close look at this. At Feedster dealing with spam (technically ‘splogs’) was tough challenge because it is very easy to make them look like bona-fide blogs, making it difficult to use normal spam detection methods to identify them with getting lots of false positives in the process.

We also got searches from ‘sploggers’ trawling for content to add to their ‘splogs’, while it was difficult to identify ‘regular’ searches from the mass of searches we got, I did make the decision at some point to reject searches for porn (and specifically child porn) using a simple keyword match (no censorship there, the searches I rejected were blatant). Aside from the nature of the search, I did not want them to soak up more resources then they deserved.

We also got searches which contained what looked like MD5 signatures, which I assume were ‘sploggers’ checking to see if their content got into the index.

And finally, we would occasionally get what looked like Denial Of Service attacks from various sites, usually the same search 50,000 times a day or more. What was surprising was that some of these would come from other search engines. I killed those searches too.

MySQL 5.1 Goes GA, Finally…

I was happy to see that MySQL 5.1 went GA on November 27th, it has taken a very long time to get there.

I use it for my current project, in fact I selected it over 5.0 in the expectation that it would go GA while my work was in progress.

We tried it out while I was at Feedster, in the fall of 2006 and rejected it pretty quickly because it was way too unstable and opted for 5.0 instead.

MySQL 5.1.29 Released

MySQL 5.1.29 was just released, the final release candidate on the way to general availability. I have been running 5.1.28 for a while with no issues.

The main change I see here (for me at least) is the ability to change logging without having to restart the server.

Logging
Log on demand is one of the main features of MySQL 5.1. It means that you can enable and disable general logs and slow query logs without restarting the server. You can also change the log file names, again without a restart.

What was missing was the ability of setting a log file name in the options file, without actually starting the logging. The old option log, could set the general log file name, but it will also start the logging immediately. If you want to set the log file name without activating the logging, now you can use general_log_file=filename or slow_query_log_file=file_name in the options file. These features were already available as system variables, but not as startup options.

From the “software takes time department”, I first started playing around with the 5.1 release while at Feedster at the end of 2006, but we chose not to use it because it was just too unstable.

Caching and system optimization

Greg Linden mentions an interesting paper out of Yahoo Research and presented at SIGIR 2008 “ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines”.

The excerpt that Greg republishes in his post talks about how using a pruned index worked against having a cache on the main index since only singleton searches would reach the main index.

Which got me thinking about the index we had at Feedster. I actually implemented the search engine there, and managed its day to day running, so I had quite a bit to say on how it was organized.

The index was broken down into segments (shard if you like) of 1 million entries. The entries were added to the index segments in the order they were crawled, a new index segments being generated every 10 minutes, in other words we would be adding new entries to an index segments until we reached 1 million entries and then we would start a new index segments. We had only a limited number of machines on which to run the search engines, so we organized the overall index into a pipeline, new segments were added to the top and older segments were removed from the bottom and placed into an archive. While the overall index was quite big (about 750GB if you must know), we usually only had about 45 days’ worth of posts seachable, which was fine because our default sort order was reverse chronological and we could satisfy most search terms with that. The 45 days of indices (about 66 index segments) were spread across five machines and this whole setup was replicated for redundancy and load balancing. In front of those there were five search gateways which would handle searches coming from the front end. The search gateway would autoconfigure themselves, searching the local net for search engines, figuring out what was there and putting together a map of the index segments, presenting that as a single index to the front end. So search machines could go down and come back up and stuff would just work, and I did not need to tell the search gateways where the indices were located.

The search engines themselves would create local caches of search results for each index segment, any cache file older than seven days would be deleted to make space for new cache files. One feature I implemented in the cache was a switch which would allow me to cache either the whole search or components of the search. For example, searches could be restricted to a set of blogs, and that restriction was implemented as a search, so it made sense to cache both the search and the restriction separately so that they could be reused for other searches which contained either components.

In addition I implements automatic index pruning, where the search engine (and the search gateway) would know the order of the index, ie reverse chronological, and would search each index segment in turn until the search was satisfied. Users could also control this index pruning from within the search through search modifiers. In fact I implemented a long list of search modifiers most of which we did not document on the site for one reason or another.

At peak times the search engine was handling about 1.5 million searches a day, since the searches were spread across pairs of machines, each machine was handling 750,000 searches/day over an index of 14 million entries.

Scaling MySQL at Facebook

By way of Greg Linden, some interesting notes and figures from various high traffic web sites on scaling MySQL.

As Greg points out, Facebook’s strategy is to partition the data and spread it across a lot of servers, which is pretty much the only way to go if you want to scale MySQL, or any site for that matter.

Crawling is indeed harder than it looks

Greg Linden (a must-read blog because he picks up new publications very quickly) has a good post aggregating a number of papers from WWW 2008 on crawling and why crawling is hard.

I wrote the version one crawler for Feedster (version zero was not very good and got ditched very quickly) and it is very difficult to write a good crawler. It is basically a balancing act, currency versus bandwidth usage, etc…

I finished writing a crawler a month or so ago for the current project I am working on and it took me a while to adjust the crawl interval based on how frequently a feed changed. I am not sure I have it quite right yet and the algorithm still needs more adjustment.

MySQL engines, MyISAM vs. Innodb

I think Narayan Newton does a very good job of summarizing the pros and cons of MyISAM and Innodb in this post “MySQL engines, MyISAM vs. Innodb”.

I have seen a lot written about this before but I think his post neatly summarizes the arguments on both sides and as worth reading if you are having to make a decision about this.

My reflex is to always use Innodb unless there is a compelling reason for using MyISAM, and it has to be really, really compelling.

I did take issue with one point he makes which he illustrates with an experience:

On the other hand, InnoDB is a largely ACID (Atomicity, Consistency, Isolation, Durability) engine, built to guarantee consistency and durability. It does this through a transaction log (with the option of a two-phase commit if you have the binary log enabled), a double-write buffer and automatic checksumming and checksum validation of database pages. These safety measures not only prevent corruption on “hard” shutdowns, but can even detect hardware failure (such as memory failure/corruption) and prevent damage to your data.

Drupal.org has made use of this feature of InnoDB as well. The database in question contains a large amount of user contributed content, cvs messages, cvs history, forum messages, comments and, more critically, the issue queues for the entire Drupal project. This is not data where corruption is an option. In 2007, the master database server for the project went down. After examining the logs, it became clear that it hadn’t crashed as such, but InnoDB had read a checksum from disk that didn’t match the checksum it had in memory. In this case, the checksum miss-match was a clear sign of memory corruption. Not only did it detect this, but it killed the MySQL daemon to prevent data corruption. In fact, it wouldn’t let the MySQL daemon run for more than a half hour on that server without killing it after finding a checksum miss-match. When your data is of the utmost importance, this is very comforting behavior.

I have certainly had this happen to me, once or twice, and it is very satisfying to see Innodb recover and carry on on its merry way. However I did experience very nasty hardware failure where a RAID controller went nuts and sprayed bad data out to storage, Innodb won’t prevent this, the database had turned into a small pile of bit goo and Innodb was not able to recover it regardless of how high the ACP(*) innodb_force_recovery was set. We had to switch to backup and wipe the original system clean. It is probable that MyISAM would have been able to recover the database because of its simpler structure.

(*) ass-covering parameter.

Read replication with MySQL – part deux

Following up on my last post on read replication with MySQL, I read this post by Greg Linden on the subject of caching which mirrors my thinking on the matter (except that his is better written):

My opinion on this differs somewhat. I agree that read-only replication is at best a temporary scaling solution, but I disagree that object caches are the solution.

I think caching is way overdone, to the point that, in some designs, the caching layers sometimes contains more machines than the database layer. Caching layers add complexity to the design, latency on a cache miss, and inefficiency to use of cluster resources.

My experience at Feedster confirms this, once we got powerful enough servers for the DBMS, we found that we did not need to use memcached at all, in fact it was a hinderance more than anything because it added to the number of machines that needed to be administered.

As a small postscriptum, this post by Ronald Bradford does a very good job of listing out the reasons for replication along with the advantages and disadvantages of each.

Read replication with MySQL

I have been following the thread about the death of read replication over on the Planet MySQL weblog with interest. In with this issue the notion of caching is thrown in to illustrate that it can be used as a substitute to read replication. (See this, this and this.)

Personally I think the two issues are separate and should be treated as such, and I will be basing this on my experiences at Feedster scaling a MySQL database from about 1GB to around 1.5TB.

Initially we relied on read-replication to shift the read load from the master server to alternative read servers. For a while this worked, but as our hardware got better (post-funding) we found that the read servers were not keeping up with replication. After some amount of digging and consultation, what became very clear to me was that the read servers were never going to catch up for a very simple reason.

While the master server and the read servers were roughly the same in terms of capacity, the issues were that the read server was having to support the same write load as the master server and, in addition, a much higher read load. Combine that with the fact that replication takes places in a single thread (whereas the master uses multiple threads to write data), and you have a situation where the read servers cannot catch up with the master server.

There are a couple of tricks you can employ to make the slave servers faster, one is to do the replication across multiple threads using a script (which I have done) but you lose referential integrity, the other is to write a utility which pre-reads the replication log and accesses relevant rows before they are accessed to make sure that replication is not slowed down waiting for data to be read off storage (this was the solution implemented by YouTube for a while).

Looping back to read replication. I agree that read replication is dead, and it should be. Replication should be used for backup purposes only, which is what we eventually did at Feedster. And your replication server should be ready to take over if the master server fails.

Onto the second issue of caching. The caching that memcached does is actually pretty simplistic. You can cache a ‘chunk of data’ somewhere and access it later if it has not been flushed to make room for other ‘chunks of data’. I say ‘chunk of data’ because that is how memcached sees it, you are responsible for serializing the data (flattening it in a contiguous area of memory) and decoding it when you get it back. Caching makes sense if it takes you more time to get data out of your database than it does getting it from cache. Ideally you want to be in a situation where you don’t need to use caching because you get get to your data fast enough. Getting to that point means having an optimized schema and a sharded database so you can take advantage of the bandwidth that multiple machine afford you. The point is to take the memory you would use for caching and give it to your database servers.

The failure of “Web 2.0″

I think this post by Greg Linden (pointing to two other posts, one by Jason Calcanis and the other by Xeni Jardin) highlights the fundamental problem with “Web 2.0″, which is that any system that gains traction will be swamped in spam.

We had this issue at Feedster going back to the beginning. It is very easy to poison the ‘pings’ and ‘feedmesh’ streams, and the Feedster ‘pings’ API was getting mostly spam. I did a quick web search and turned up two tools which allow a user to create an RSS feed and submit it to all the major search engines with a simple click. We also much more complex schemes to get spam into the search engine, as well as search bots ripping content wholesale, as well as bots trying to DOS our system.

Follow

Get every new post delivered to your Inbox.