Detecting Spam just from HTTP headers

By way of Geeking with Greg, a paper on detecting spam just from HTTP headers, “Predicting Web Spam with HTTP Session Information” (PDF).

This is very interesting and I will be taking a close look at this. At Feedster dealing with spam (technically ‘splogs’) was tough challenge because it is very easy to make them look like bona-fide blogs, making it difficult to use normal spam detection methods to identify them with getting lots of false positives in the process.

We also got searches from ‘sploggers’ trawling for content to add to their ‘splogs’, while it was difficult to identify ‘regular’ searches from the mass of searches we got, I did make the decision at some point to reject searches for porn (and specifically child porn) using a simple keyword match (no censorship there, the searches I rejected were blatant). Aside from the nature of the search, I did not want them to soak up more resources then they deserved.

We also got searches which contained what looked like MD5 signatures, which I assume were ‘sploggers’ checking to see if their content got into the index.

And finally, we would occasionally get what looked like Denial Of Service attacks from various sites, usually the same search 50,000 times a day or more. What was surprising was that some of these would come from other search engines. I killed those searches too.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: