Taz Crawlers From China

Continuing with experiences on the net another thing I have noticed are crawlers  operating from China (at least that is what their originating IP addresses are from). None are well behaved and ignore the robots.txt file, some are well written insofar that they are efficient, but most are not and will download anything and everything usually multiple times. I call those Taz crawlers (think Tasmanian Devil) for what should be obvious reasons. We block all these crawlers.

DDOS & Dumb Choices

Recently one of the sites I manage was subjected to a DDOS attack. It was not DDOS attack per-se, but someone wanted some very specific data from the site and thought it would be a good idea to contract it out to a ‘bot farm. The reason I say that they wanted some data was that the urls were very specific. The net effect was a DDOS because lots of ‘bots from everywhere around the world were hammering the site for this data, over and over again. We were lucky in that the attack started slowly so we were able to check the HTTP request used to see how we could screen for it and turn away requests before they got too far down the stack. The attack lasted about 5 days.

A few things to note about this. The HTTP request was easily recognizable so could be screened out. The data was spread over 160 pages with one page summarizing the data so one single request would have gotten the data. Because we were able to screen out the requests the ‘bots failed to get the data. There is a contact form on the site and they could have just asked.

Baylor-Hopkins Center for Mendelian Genomics

And the other project I have been working on is a website for the Baylor-Hopkins Center for Mendelian Genomics. This website is designed to capture patient feature and DNA sample information for sequencing single-gene mendelian phenotypes. The site is not really for public consumption though.

OMIM – Online Mendelian Inheritance in Man

For the curious, for the past 18 months I have been working on the OMIM website on-and-off. From the website:

OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

Worth a look if you are interested in genetics.

 

Google Please Don’t Crawl This Server

For some reason the Googlebot has found it necessary to crawl a development server of mine, I suspect that one of the users uses Google Chrome which probably snarfs urls browsed.

Google tells us that one way to do this is to return the 410 HTTP  status code , and the way to enforce this in httpd.conf is:


# Tell crawlers to go away   
RewriteEngine On    
RewriteCond %{HTTP_USER_AGENT} (Googlebot|bingbot|Validator|MJ12bot|Baiduspider)    
RewriteRule ^.* - [G]

 

I have included other crawlers in the list just to make sure.

 

Whale Tail

This is the tail end of a baby Humpback Whale who cruised past us, when I say cruised I mean mobbed us, three people go hit (there was no damage). I caught a picture of its’ tail which was about 5 feet wide.

You can see the front part of the whale here.

Eye Contact

I had the opportunity to get on a boat going to the Silver Banks which are located north of the Dominican Republic. There was a group cancellation and I jumped on (there is usually a two year waiting list for this trip.)

Humpback whales migrate from the North Atlantic down to the Silver Banks to raise their young for three months at the start of the year. So there is an opportunity to get in the water and swim with them.

This is a young calf (still 12 feet long) who came right up to me to check me out.