Abot Web Crawler Performance

Question

I have built a robots.txt crawler which extracts the urls out of robots and then loads the page with some post processing once the page is done. This all happens quite fast, and I can extract information from 5 pages per second.

In the event a website doesn't have a robots.txt I use Abot Web Crawler instead. The problem is Abot is far slower than the direct robots.txt crawler. It seems when Abot hits a page with lots of links, it schedules each link very slowly. With some pages taking 20+ seconds to queue all and run the post process as mentioned above.

I use the PoliteWebCrawler which is configured to not crawl external pages. Should I instead be crawling multiple websites at once or is there another, faster solution to Abot?

Thanks!

sjdirect · Accepted Answer

Added a patch to Abot to fix issues like this one. Should be available in nuget version 1.5.1.42. See issue #134 for more details. Can you verify this fixed your issue?

Abot Web Crawler Performance

Tags:

c#

.net

screen-scraping

web-crawler

Maitland Marshall

1 Answers

sjdirect

Recent Activity

Donate For Us

Abot Web Crawler Performance

Tags:

c#

.net

screen-scraping

web-crawler

Maitland Marshall

1 Answers

sjdirect

Related questions

Recent Activity

Donate For Us