Crawling is indeed harder than it looks
Greg Linden (a must-read blog because he picks up new publications very quickly) has a good post aggregating a number of papers from WWW 2008 on crawling and why crawling is hard.
I wrote the version one crawler for Feedster (version zero was not very good and got ditched very quickly) and it is very difficult to write a good crawler. It is basically a balancing act, currency versus bandwidth usage, etc…
I finished writing a crawler a month or so ago for the current project I am working on and it took me a while to adjust the crawl interval based on how frequently a feed changed. I am not sure I have it quite right yet and the algorithm still needs more adjustment.






leave a comment