I have built a robots.txt crawler which extracts the urls out of robots and then loads the page with some post processing once the page is done. This all happens quite fast, and I can extract information from 5 pages per second.
In the event a website doesn't have a robots.txt I use Abot Web Crawler instead. The problem is Abot is far slower than the direct robots.txt crawler. It seems when Abot hits a page with lots of links, it schedules each link very slowly. With some pages taking 20+ seconds to queue all and run the post process as mentioned above.
I use the PoliteWebCrawler which is configured to not crawl external pages. Should I instead be crawling multiple websites at once or is there another, faster solution to Abot?
Thanks!
Added a patch to Abot to fix issues like this one. Should be available in nuget version 1.5.1.42. See issue #134 for more details. Can you verify this fixed your issue?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With