Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which solutions are faster when extract content from webcrawler

Tags:

c#

asp.net

I have made a web crawler by using Asp.net. It's work well. Problem is when I want to extract content from it. Some of content wrap by between HTML tags. I have some of solutions to extract content from it but I don't know which one are better. It should be good performance and easy to implement.

  1. Using Regex with many patterns to extact content.

  2. Using Linq to XML to extract content.

  3. Using XPath to extract content.

Somebody please help me choose the better solutions. I think I will go with XPath but I am not sure about performance are better than RegEx or Linq2XML.

Many thanks for any ideas.

like image 622
Tim Phan Avatar asked Nov 20 '25 17:11

Tim Phan


2 Answers

None of your solutions is particularly good.

  1. HTML is not a regular language and as such is not a good fit for regular expressions. See also the standard response to parsing HTML with regex.
  2. HTML is not necessarily valid XML

Instead, you should use a HTML parsing library like the Html Agility Pack.

like image 132
Daniel Hilgarth Avatar answered Nov 23 '25 06:11

Daniel Hilgarth


Neither. Use a proper HTML parser such as HTML Agility Pack

like image 31
Darko Kenda Avatar answered Nov 23 '25 07:11

Darko Kenda



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!