Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing my first webcrawler

Tags:

c#

web-crawler

I've tried to find som good how to, or some example that is good for beginners when it comes to write your first web crawler. I would like to write it in c#. Does anybody have any good example code to share or some tips on some sites where I can find info for c#, and some bacic webcrawling.

Thanks

like image 500
Fore Avatar asked Jan 25 '26 06:01

Fore


2 Answers

HtmlAgilityPack is your friend.

like image 177
kenny Avatar answered Jan 26 '26 19:01

kenny


Yes, HtmlAgeilityPack is a good tool to parse the HTML but that is definitely not enough.

There are 3 elements to crawling:

1) Crawling itself i.e. looping through web sites: This can be done by sending requests to random IP addresses but this does not work well since many websites use shared IP address HTTP with host header so using IP does not hit it. On the other hand, there are far too many IP addresses unused or not hosting a web server so this does not get you anywhere.

I suggest you send request to google (search for words from a dictionary) and crawl the results coming back.

2) Rendering the content: Many websites generate the HTML content in JavaScript when the form is loaded so if you send a simple request, it will not be able to capture the content as a user would be able to see. You need to render the page as browser does and that can be done using Webkit.net which is an open source tool although still in beta.

3) Comprehending and parsing the HTML: use HTML pack and there are tons of examples online. This can be used to crawl the site as well.

like image 23
Aliostad Avatar answered Jan 26 '26 19:01

Aliostad