François Schiettecatte’s Blog

May 10, 2008

Crawling is indeed harder than it looks

Filed under: Feedster, Scaling, Search — François Schiettecatte @ 10:10 am

Greg Linden (a must-read blog because he picks up new publications very quickly) has a good post aggregating a number of papers from WWW 2008 on crawling and why crawling is hard.

I wrote the version one crawler for Feedster (version zero was not very good and got ditched very quickly) and it is very difficult to write a good crawler. It is basically a balancing act, currency versus bandwidth usage, etc…

I finished writing a crawler a month or so ago for the current project I am working on and it took me a while to adjust the crawl interval based on how frequently a feed changed. I am not sure I have it quite right yet and the algorithm still needs more adjustment.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

You must be logged in to post a comment.

Blog at WordPress.com.