I'm using domDocument to parse some HTML, and want to replace breaks with \n. However, I'm having problems identifying where a break actually occurs within the document.
Given the following snippet of HTML - from a much larger file that I'm reading using $dom->loadHTMLFile($pFilename):
<p>Multiple-line paragraph<br />that has a close tag</p>
and my code:
foreach ($dom->getElementsByTagName('*') as $domElement) {
switch (strtolower($domElement->nodeName)) {
case 'p' :
$str = (string) $domElement->nodeValue;
echo 'PARAGRAPH: ',$str,PHP_EOL;
break;
case 'br' :
echo 'BREAK: ',PHP_EOL;
break;
}
}
I get:
PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:
How can I identify the position of that break within the paragraph, and replace it with a \n ?
Or is there a better alternative than using domDocument for parsing HTML that may or may not be well-formed?
You can't get the position of an element using getElementsByTagName. You should go through childNodes of each element and process text nodes and elements separately.
In the general case you'll need recursion, like this:
function processElement(DOMNode $element){
foreach($element->childNodes as $child){
if($child instanceOf DOMText){
echo $child->nodeValue,PHP_EOL;
}elseif($child instanceOf DOMElement){
switch($child->nodeName){
case 'br':
echo 'BREAK: ',PHP_EOL;
break;
case 'p':
echo 'PARAGRAPH: ',PHP_EOL;
processElement($child);
echo 'END OF PARAGRAPH;',PHP_EOL;
break;
// etc.
// other cases:
default:
processElement($child);
}
}
}
}
$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);
This will output:
PARAGRAPH:
Multiple-line paragraph
BREAK:
that has a close tag
END OF PARAGRAPH;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With