Scraping malformed HTML with PHP DomDocument

Question

I'm using PHP DomDocument + XPath for scraping various web pages. I found that in some cases DomDocument even unable to load HTML, just returns an empty result. For example, page contains two body tags or has wrong DOCTYPE declaration. I've tried to preprocess malformed HTML with PHP Tidy and it really helps but PHP Tidy is very slow!

I don't want to use any third-party libraries like Simple Html Dom Parser

Please advise how to deal with malformed HTML using PHP DomDocument. Should I write a custom regexp to fix broken HTML before sending to DomDocument? Maybe I missed some settings for PHP DomDocument?

UPD

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, 'http://example.com');
$result = curl_exec($ch);
curl_close($ch);

$dom = new DomDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($result);
libxml_clear_errors();
var_dump($dom);

$xpath = new DomXPath($dom);
$nodes = $xpath->query(".//*[@id='content']/ul/li/div[2]/h3/a");

var_dump($nodes); // Nothing

Result of var_dump($dom);

object(DOMDocument)#25 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
  ["implementation"]=>
  string(22) "(object value omitted)"
  ["documentElement"]=>
  NULL
  ["actualEncoding"]=>
  string(5) "UTF-8"
  ["encoding"]=>
  string(5) "UTF-8"
  ["xmlEncoding"]=>
  string(5) "UTF-8"
  ["standalone"]=>
  bool(true)
  ["xmlStandalone"]=>
  bool(true)
  ["version"]=>
  NULL
  ["xmlVersion"]=>
  NULL
  ["strictErrorChecking"]=>
  bool(true)
  ["documentURI"]=>
  NULL
  ["config"]=>
  NULL
  ["formatOutput"]=>
  bool(false)
  ["validateOnParse"]=>
  bool(false)
  ["resolveExternals"]=>
  bool(false)
  ["preserveWhiteSpace"]=>
  bool(true)
  ["recover"]=>
  bool(false)
  ["substituteEntities"]=>
  bool(false)
  ["nodeName"]=>
  string(9) "#document"
  ["nodeValue"]=>
  NULL
  ["nodeType"]=>
  int(13)
  ["parentNode"]=>
  NULL
  ["childNodes"]=>
  string(22) "(object value omitted)"
  ["firstChild"]=>
  string(22) "(object value omitted)"
  ["lastChild"]=>
  string(22) "(object value omitted)"
  ["previousSibling"]=>
  NULL
  ["attributes"]=>
  NULL
  ["ownerDocument"]=>
  NULL
  ["namespaceURI"]=>
  NULL
  ["prefix"]=>
  string(0) ""
  ["localName"]=>
  NULL
  ["baseURI"]=>
  NULL
  ["textContent"]=>
  string(0) ""
}

UPD2. Repeating <body> is OK for DomDocument. There were leading whitespaces in the html, solved by adding trim() $dom->loadHTML(trim($result));

Matthew Brown aka Lord Matt · Accepted Answer

DOMDocument's loadHTML() method copes fairly well with malformed HTML however it is going to generate a lot of errors. You will want to suppress these errors from bubbling up into your default error handler like this:

<?php
// some process of fetching the HTML page
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($scrappedPage);

It might be worthwhile using CURL to grab the file to be scrapped if you are not doing that before passing it to DOMDocument to be sure that you are not suffering from timeout issues while dealing with very bad HTML. This would also enable you to catch the file locally and inspect the errors that are being encountered. It would also mean that you would have a malformed HTML example to show for your next question.

Since PHP 5.4.0 and Libxml 2.6.0, you can also use the optional options parameter to give additional Libxml parameters. Some of these might be of use:

LIBXML_HTML_NODEFDTD : prevents a default doctype being added when one is not found
LIBXML_PARSEHUGE : relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes.
Read more: http://php.net/manual/en/libxml.constants.php

Scraping malformed HTML with PHP DomDocument

Tags:

dom

php

xpath

domdocument

ymakux

1 Answers

Matthew Brown aka Lord Matt

Recent Activity

Donate For Us

Scraping malformed HTML with PHP DomDocument

Tags:

dom

php

xpath

domdocument

ymakux

1 Answers

Matthew Brown aka Lord Matt

Related questions

Recent Activity

Donate For Us