using HtmlAgilityPack for parsing a web page information in C#

Question

I'm trying to use HtmlAgilityPack for parsing a web page information. This is my code:

using System;
using HtmlAgilityPack;

namespace htmparsing
{
    class MainClass
    {
        public static void Main (string[] args)
        {
            string url = "https://bugs.eclipse.org";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(url);
            foreach(HtmlNode node in doc){
                //do something here with "node"
            }               
        }
    }
}

But when I tried to access to doc.DocumentElement.SelectNodes I can not see DocumentElement in the list. I added the HtmlAgilityPack.dll in the references, but I don't know what's the problem.

Md Ashaduzzaman · Accepted Answer

I've an article that demonstrates scraping DOM elements with HAP (HTML Agility Pack) using ASP.NET. It simply lets you go through the whole process step by step. You can have a look and try it.

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

and about your process it's working fine for me. I've tried this way as you did with a single change.

string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a")) 
{
    outputLabel.Text += node.InnerHtml;
}

Got the output as expected. The problem is you are asking for DocumentElement from HtmlDocument object which actually should be DocumentNode. Here's a response from a developer of HTMLAgilityPack about the problem you are facing.

HTMLDocument.DocumentElement not in object browser

smartcaveman · Answer

The behavior you are seeing is correct.

Look at what you're actually doing: http://htmlagilitypack.codeplex.com/SourceControl/latest#Release/1_4_0/HtmlAgilityPack/HtmlNode.cs .

You're asking the top element to select nodes matching some xpath. Unless your xpath expression starts with a //, you're asking it for relative nodes, which are descendant nodes. A document element is a not a descendant of itself, because no element is a descendant of itself.

using HtmlAgilityPack for parsing a web page information in C#

Tags:

html

c#

html-agility-pack

star2014

2 Answers

Md Ashaduzzaman

The behavior you are seeing is correct.

smartcaveman

Recent Activity

Donate For Us

using HtmlAgilityPack for parsing a web page information in C#

Tags:

html

c#

html-agility-pack

star2014

2 Answers

Md Ashaduzzaman

The behavior you are seeing is correct.

smartcaveman

Related questions

Recent Activity

Donate For Us