Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using HtmlAgilityPack for parsing a web page information in C#

I'm trying to use HtmlAgilityPack for parsing a web page information. This is my code:

using System;
using HtmlAgilityPack;

namespace htmparsing
{
    class MainClass
    {
        public static void Main (string[] args)
        {
            string url = "https://bugs.eclipse.org";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(url);
            foreach(HtmlNode node in doc){
                //do something here with "node"
            }               
        }
    }
}

But when I tried to access to doc.DocumentElement.SelectNodes I can not see DocumentElement in the list. I added the HtmlAgilityPack.dll in the references, but I don't know what's the problem.

like image 916
star2014 Avatar asked Dec 28 '25 16:12

star2014


2 Answers

I've an article that demonstrates scraping DOM elements with HAP (HTML Agility Pack) using ASP.NET. It simply lets you go through the whole process step by step. You can have a look and try it.

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

and about your process it's working fine for me. I've tried this way as you did with a single change.

string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a")) 
{
    outputLabel.Text += node.InnerHtml;
}

Got the output as expected. The problem is you are asking for DocumentElement from HtmlDocument object which actually should be DocumentNode. Here's a response from a developer of HTMLAgilityPack about the problem you are facing.

HTMLDocument.DocumentElement not in object browser

like image 144
Md Ashaduzzaman Avatar answered Dec 31 '25 04:12

Md Ashaduzzaman


The behavior you are seeing is correct.

Look at what you're actually doing: http://htmlagilitypack.codeplex.com/SourceControl/latest#Release/1_4_0/HtmlAgilityPack/HtmlNode.cs .

You're asking the top element to select nodes matching some xpath. Unless your xpath expression starts with a //, you're asking it for relative nodes, which are descendant nodes. A document element is a not a descendant of itself, because no element is a descendant of itself.

like image 20
smartcaveman Avatar answered Dec 31 '25 04:12

smartcaveman