Detecting Spam just from HTTP headers
December 7, 2008 Leave a comment
This is very interesting and I will be taking a close look at this. At Feedster dealing with spam (technically ‘splogs’) was tough challenge because it is very easy to make them look like bona-fide blogs, making it difficult to use normal spam detection methods to identify them with getting lots of false positives in the process.
We also got searches from ‘sploggers’ trawling for content to add to their ‘splogs’, while it was difficult to identify ‘regular’ searches from the mass of searches we got, I did make the decision at some point to reject searches for porn (and specifically child porn) using a simple keyword match (no censorship there, the searches I rejected were blatant). Aside from the nature of the search, I did not want them to soak up more resources then they deserved.
We also got searches which contained what looked like MD5 signatures, which I assume were ‘sploggers’ checking to see if their content got into the index.
And finally, we would occasionally get what looked like Denial Of Service attacks from various sites, usually the same search 50,000 times a day or more. What was surprising was that some of these would come from other search engines. I killed those searches too.