Tag Archives: domdocument

Some HTML DOM parsing gotchas in PHP’s DOMDocument

Although I had used Simple HTML DOM parser for WP DoNotTrack, I’ve been looking into native PHP HTML DOM parsing as a possible replacement for regular expressions for Autoptimize as proposed by Arturo. I won’t go into the performance comparison results just yet, but here’s some of the things I learned while experimenting with DOMDocument which in turn might help innocent passers-by of this blogpost.

  • loadHTML doesn’t always play nice with different character encodings, you might need something like mb_convert_encoding to work around that.
  • loadHTML will try to “repair” your HTML to make sure an XML-parser can work with it. So what goes in will not come out the same way.
  • loadHTML will spit out tons of warnings or notices about the HTML not being XML; you might want to suppress error-reporting by prepending the command with an @ (e.g. @$dom->loadHTML($htmlstring);)
  • If you use e.g. getELementsByTagName to extract nodes into a seperate DomNodeList and you want to use that to change the DomDocument can result in … unexpected behavior as the DomNodeList gets updated when changes are made to the DomDocument. Copy the DomNodes from the DomNodeList into a new array (which will not get altered) and iterate over that to update the DomDocument as seen in the example below.
  • removeChild is a method of DomNode, not of DomDocument. This means $dom->removeChild(DomNode) will not work. Instead invoke removeChild on the parent of the node you want to remove as seen in the example below
 // loadHTML from string, suppressing errors $dom = new DOMDocument(); @$dom->loadHTML($html); // get all script-nodes $_scripts=$dom->getElementsByTagName("script"); // move the result form a DomNodeList to an array $scripts = array(); foreach ($_scripts as $script) { $scripts[]=$script; } // iterate over array and remove script-tags from DOM foreach ($scripts as $script) { $script->parentNode->removeChild($script); } // write DOM back to the HTML-string $html = $dom->saveHTML(); 

Now chop chop, back to my code to finish that performance comparison. Who know what else we’ll learn ;-)