Which solutions are faster when extract content from webcrawler

Question

I have made a web crawler by using Asp.net. It's work well. Problem is when I want to extract content from it. Some of content wrap by between HTML tags. I have some of solutions to extract content from it but I don't know which one are better. It should be good performance and easy to implement.

Using Regex with many patterns to extact content.
Using Linq to XML to extract content.
Using XPath to extract content.

Somebody please help me choose the better solutions. I think I will go with XPath but I am not sure about performance are better than RegEx or Linq2XML.

Many thanks for any ideas.

Daniel Hilgarth · Accepted Answer

None of your solutions is particularly good.

HTML is not a regular language and as such is not a good fit for regular expressions. See also the standard response to parsing HTML with regex.
HTML is not necessarily valid XML

Instead, you should use a HTML parsing library like the Html Agility Pack.

Darko Kenda · Answer

Neither. Use a proper HTML parser such as HTML Agility Pack

Which solutions are faster when extract content from webcrawler

Tags:

c#

asp.net

Tim Phan

2 Answers

Daniel Hilgarth

Darko Kenda

Recent Activity

Donate For Us

Which solutions are faster when extract content from webcrawler

Tags:

c#

asp.net

Tim Phan

2 Answers

Daniel Hilgarth

Darko Kenda

Related questions

Recent Activity

Donate For Us