Some HTML DOM parsing gotchas in PHP’s DOMDocument

Although I had used Simple HTML DOM parser for WP DoNotTrack, I’ve been looking into native PHP HTML DOM parsing as a possible replacement for regular expressions for Autoptimize as proposed by Arturo. I won’t go into the performance comparison results just yet, but here’s some of the things I learned while experimenting with DOMDocument which in turn might help innocent passers-by of this blogpost.

  • loadHTML doesn’t always play nice with different character encodings, you might need something like mb_convert_encoding to work around that.
  • loadHTML will try to “repair” your HTML to make sure an XML-parser can work with it. So what goes in will not come out the same way.
  • loadHTML will spit out tons of warnings or notices about the HTML not being XML; you might want to suppress error-reporting by prepending the command with an @ (e.g. @$dom->loadHTML($htmlstring);)
  • If you use e.g. getELementsByTagName to extract nodes into a seperate DomNodeList and you want to use that to change the DomDocument can result in … unexpected behavior as the DomNodeList gets updated when changes are made to the DomDocument. Copy the DomNodes from the DomNodeList into a new array (which will not get altered) and iterate over that to update the DomDocument as seen in the example below.
  • removeChild is a method of DomNode, not of DomDocument. This means $dom->removeChild(DomNode) will not work. Instead invoke removeChild on the parent of the node you want to remove as seen in the example below
// loadHTML from string, suppressing errors
$dom = new DOMDocument();
@$dom->loadHTML($html);
// get all script-nodes
$_scripts=$dom->getElementsByTagName("script");
// move the result form a DomNodeList to an array
$scripts = array();
foreach ($_scripts as $script) {
   $scripts[]=$script;
}
// iterate over array and remove script-tags from DOM
foreach ($scripts as $script) {
   $script->parentNode->removeChild($script);
}
// write DOM back to the HTML-string
$html = $dom->saveHTML();

Now chop chop, back to my code to finish that performance comparison. Who know what else we’ll learn 😉

Simple HTML DOM Parser not that simple

Notwithstanding the name, using PHP Simple HTML DOM Parser isn’t always simple. While working on some issues with WP DoNotTrack‘s SuperClean mode, I encountered these two quirks:

  1. By default Simple HTML DOM removes linebreaks. That means that when you write the modified DOM back to a string for outputting, some (sloppy) JavaScript is bound to break. The solution: pass extra arguments to the DOM-creating functions, as “documented” in the Simple HMTL DOM’s source code. For str_get_html it reads:
    function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
    

    Set the 5th argument to false to tell the parser not to remove “\r\n”‘s.

  2. Simple HTML DOM is very liberal. It is so liberal, in fact, that it will try to make a DOM out of whatever you throw at it, without even blinking. Until you try to find elements using “find” on the DOM Object, that is, because at that point you might get a “Fatal error: Call to a member function find() on a non-object“-error thrown back at you. You can avoid that nastiness by checking the object for the existence of the find-method and, while you’re at it, also check if there is a HTML-element in the DOM:
    $html = file_get_html('http://url.to/filename.html');
    // first check if $html->find exists
    if (method_exists($html,"find")) {
         // then check if the html element exists to avoid trying to parse non-html
         if ($html->find('html')) {
              // and only then start searching (and manipulating) the dom
         }
    }

So that’s how to put the simple back into PHP Simple HTML DOM Parser. Until the next quirk comes up, because that’s what parsing HTML is all about after all, no?

WP DoNotTrack 0.6.0 and beyond

I finally found some time to continue to work my other WordPress plugin. WP DoNotTrack checks for elements being added to the DOM by JavaScript to stop 3rd party tracking by some of the major plugins or themes.

Version 0.6.0, which I released last week, features a new “forced” option. This mode aims to provide better compatibility with JavaScript-optimizing plugins such as Autoptimize and W3 Total Cache by adding the relevant code only after those optimizers have done their job, using the output buffer. There will probably be a 0.6.1 today or tomorrow, to solve a small problem with mixed HTTP/HTTPS requests on the admin-page while in HTTPS. The output buffer sure is a powerful thing and for version 0.7.0, I’ll build on that to optionally filter the full HTML (with PHP Simple HTML DOM Parser) to stop unwanted requests for images, scripts or iFrames in there.

Do contact me if you found a bug, if you have questions or if you’d like specific feature to be added, I tend to rely heavily on user feedback to improve my plugins! And if you’re happy with how it works, drop by on the WP DoNotTrack-page on wordpress.org to rate it and/ or to confirm it works with your version of WordPress!