Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove blank lines from HTML with HTMLAgilityPack?

I have a HTML document that contains lots of needless blank lines which I'd like to remove. Here's a sample of the HTML:

<html>

<head>


</head>

<body>

<h1>Heading</h1>

<p>Testing

I've tried the following code but it removed every newline, I just want to remove the ones that are blank lines.

static string RemoveLineReturns(string html)
    {
        html = html.Replace(Environment.NewLine, "");
        return html;
    }

Any idea how to do this with HTMLAgilityPack? Thanks, J.

like image 600
bearaman Avatar asked Nov 29 '25 01:11

bearaman


1 Answers

One possible way using Html Agility Pack :

var doc = new HtmlDocument();
//TODO: load your HtmlDocument here

//select all empty (containing white-space(s) only) text nodes :
var xpath = "//text()[not(normalize-space())]";
var emptyNodes = doc.DocumentNode.SelectNodes(xpath);

//replace each and all empty text nodes with single new-line text node
foreach (HtmlNode emptyNode in emptyNodes)
{
    emptyNode.ParentNode
             .ReplaceChild(HtmlTextNode.CreateNode(Environment.NewLine) 
                            , emptyNode
                           );
}
like image 141
har07 Avatar answered Nov 30 '25 14:11

har07



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!