Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping malformed HTML with PHP DomDocument

I'm using PHP DomDocument + XPath for scraping various web pages. I found that in some cases DomDocument even unable to load HTML, just returns an empty result. For example, page contains two body tags or has wrong DOCTYPE declaration. I've tried to preprocess malformed HTML with PHP Tidy and it really helps but PHP Tidy is very slow!

I don't want to use any third-party libraries like Simple Html Dom Parser

Please advise how to deal with malformed HTML using PHP DomDocument. Should I write a custom regexp to fix broken HTML before sending to DomDocument? Maybe I missed some settings for PHP DomDocument?

UPD

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, 'http://example.com');
$result = curl_exec($ch);
curl_close($ch);

$dom = new DomDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($result);
libxml_clear_errors();
var_dump($dom);

$xpath = new DomXPath($dom);
$nodes = $xpath->query(".//*[@id='content']/ul/li/div[2]/h3/a");

var_dump($nodes); // Nothing

Result of var_dump($dom);

object(DOMDocument)#25 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
  ["implementation"]=>
  string(22) "(object value omitted)"
  ["documentElement"]=>
  NULL
  ["actualEncoding"]=>
  string(5) "UTF-8"
  ["encoding"]=>
  string(5) "UTF-8"
  ["xmlEncoding"]=>
  string(5) "UTF-8"
  ["standalone"]=>
  bool(true)
  ["xmlStandalone"]=>
  bool(true)
  ["version"]=>
  NULL
  ["xmlVersion"]=>
  NULL
  ["strictErrorChecking"]=>
  bool(true)
  ["documentURI"]=>
  NULL
  ["config"]=>
  NULL
  ["formatOutput"]=>
  bool(false)
  ["validateOnParse"]=>
  bool(false)
  ["resolveExternals"]=>
  bool(false)
  ["preserveWhiteSpace"]=>
  bool(true)
  ["recover"]=>
  bool(false)
  ["substituteEntities"]=>
  bool(false)
  ["nodeName"]=>
  string(9) "#document"
  ["nodeValue"]=>
  NULL
  ["nodeType"]=>
  int(13)
  ["parentNode"]=>
  NULL
  ["childNodes"]=>
  string(22) "(object value omitted)"
  ["firstChild"]=>
  string(22) "(object value omitted)"
  ["lastChild"]=>
  string(22) "(object value omitted)"
  ["previousSibling"]=>
  NULL
  ["attributes"]=>
  NULL
  ["ownerDocument"]=>
  NULL
  ["namespaceURI"]=>
  NULL
  ["prefix"]=>
  string(0) ""
  ["localName"]=>
  NULL
  ["baseURI"]=>
  NULL
  ["textContent"]=>
  string(0) ""
}

UPD2. Repeating <body> is OK for DomDocument. There were leading whitespaces in the html, solved by adding trim() $dom->loadHTML(trim($result));

like image 524
ymakux Avatar asked Oct 17 '25 18:10

ymakux


1 Answers

DOMDocument's loadHTML() method copes fairly well with malformed HTML however it is going to generate a lot of errors. You will want to suppress these errors from bubbling up into your default error handler like this:

<?php
// some process of fetching the HTML page
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($scrappedPage);

It might be worthwhile using CURL to grab the file to be scrapped if you are not doing that before passing it to DOMDocument to be sure that you are not suffering from timeout issues while dealing with very bad HTML. This would also enable you to catch the file locally and inspect the errors that are being encountered. It would also mean that you would have a malformed HTML example to show for your next question.

Since PHP 5.4.0 and Libxml 2.6.0, you can also use the optional options parameter to give additional Libxml parameters. Some of these might be of use:

  • LIBXML_HTML_NODEFDTD : prevents a default doctype being added when one is not found
  • LIBXML_PARSEHUGE : relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes.
  • Read more: http://php.net/manual/en/libxml.constants.php
like image 129
Matthew Brown aka Lord Matt Avatar answered Oct 19 '25 08:10

Matthew Brown aka Lord Matt