Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Abot Web Crawler Performance

I have built a robots.txt crawler which extracts the urls out of robots and then loads the page with some post processing once the page is done. This all happens quite fast, and I can extract information from 5 pages per second.

In the event a website doesn't have a robots.txt I use Abot Web Crawler instead. The problem is Abot is far slower than the direct robots.txt crawler. It seems when Abot hits a page with lots of links, it schedules each link very slowly. With some pages taking 20+ seconds to queue all and run the post process as mentioned above.

I use the PoliteWebCrawler which is configured to not crawl external pages. Should I instead be crawling multiple websites at once or is there another, faster solution to Abot?

Thanks!

like image 855
Maitland Marshall Avatar asked Mar 18 '26 03:03

Maitland Marshall


1 Answers

Added a patch to Abot to fix issues like this one. Should be available in nuget version 1.5.1.42. See issue #134 for more details. Can you verify this fixed your issue?

like image 192
sjdirect Avatar answered Mar 20 '26 15:03

sjdirect