Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument removing closing tag within script tags

I have the following test.php file, and when I run it, the closing </h1> tag gets removed.

<?php

$doc = new DOMDocument();

$doc->loadHTML('<html>
    <head>
        <script>
            console.log("<h1>hello</h1>");
        </script>
    </head>
    <body>

    </body>
</html>');

echo $doc->saveHTML();

Here is the result when I execute the file:

PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : h1 in Entity, line: 4 in /home/ryan/NetBeansProjects/blog/test.php on line 14

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <head>
        <script>
            console.log("<h1>hello");
        </script>
    </head>
    <body>
    </body>
</html>

So, why is it removing the tag? It's a string so shouldn't it ignore it?

like image 676
Get Off My Lawn Avatar asked Oct 28 '25 14:10

Get Off My Lawn


1 Answers

The only solution that comes to mind is to preg match the script tags, then replace them with a temporary holder like <script id="myuniqueid"></script> and at the end of dom management replace again with the actual script, like this:

//  The dom doc
$doc = new DOMDocument();

//  The html
$html = '<html>
    <head>
        <script>
            console.log("<h1>hello</h1>");
        </script>
    </head>
    <body>

    </body>
</html>';

//  Patter for scripts
$pattern = "/<script([^']*?)<\/script>/";
//  Get all scripts
preg_match_all($pattern, $html, $matches);

//  Only unique scripts
$matches = array_unique( $matches[0] );

//  Construct the arrays for replacement
foreach ( $matches as $match ) {
  //  The simple script
  $id = uniqid('script_');
  $uniqueScript = "<script id=\"$id\"></script>";
  $simple[] = $uniqueScript;
  //  The complete script
  $complete[] = $match;
}

//  Replace the scripts with the simple scripts
$html = str_replace($complete, $simple, $html);
//  load the html into the dom
$doc->loadHTML( $html);

//  Do the dom management here
//  TODO: Whatever you do with the dom

//  When finished
//  Get the html back
$html = $doc->saveHTML();
//  Replace the scripts back
$html = str_replace($simple, $complete, $html);
//Print the result
echo $html;

This solution prints clean without dom errors.

like image 154
espino316 Avatar answered Oct 31 '25 05:10

espino316