Tag Archives: simple html dom parser

Simple HTML DOM Parser not that simple

Notwithstanding the name, using PHP Simple HTML DOM Parser isn’t always simple. While working on some issues with WP DoNotTrack‘s SuperClean mode, I encountered these two quirks:

  1. By default Simple HTML DOM removes linebreaks. That means that when you write the modified DOM back to a string for outputting, some (sloppy) JavaScript is bound to break. The solution: pass extra arguments to the DOM-creating functions, as “documented” in the Simple HMTL DOM’s source code. For str_get_html it reads:
    function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
    

    Set the 5th argument to false to tell the parser not to remove “\r\n”‘s.

  2. Simple HTML DOM is very liberal. It is so liberal, in fact, that it will try to make a DOM out of whatever you throw at it, without even blinking. Until you try to find elements using “find” on the DOM Object, that is, because at that point you might get a “Fatal error: Call to a member function find() on a non-object“-error thrown back at you. You can avoid that nastiness by checking the object for the existence of the find-method and, while you’re at it, also check if there is a HTML-element in the DOM:
    $html = file_get_html('http://url.to/filename.html');
    // first check if $html->find exists
    if (method_exists($html,"find")) {
         // then check if the html element exists to avoid trying to parse non-html
         if ($html->find('html')) {
              // and only then start searching (and manipulating) the dom
         }
    }

So that’s how to put the simple back into PHP Simple HTML DOM Parser. Until the next quirk comes up, because that’s what parsing HTML is all about after all, no?

WP DoNotTrack 0.6.0 and beyond

I finally found some time to continue to work my other WordPress plugin. WP DoNotTrack checks for elements being added to the DOM by JavaScript to stop 3rd party tracking by some of the major plugins or themes.

Version 0.6.0, which I released last week, features a new “forced” option. This mode aims to provide better compatibility with JavaScript-optimizing plugins such as Autoptimize and W3 Total Cache by adding the relevant code only after those optimizers have done their job, using the output buffer. There will probably be a 0.6.1 today or tomorrow, to solve a small problem with mixed HTTP/HTTPS requests on the admin-page while in HTTPS. The output buffer sure is a powerful thing and for version 0.7.0, I’ll build on that to optionally filter the full HTML (with PHP Simple HTML DOM Parser) to stop unwanted requests for images, scripts or iFrames in there.

Do contact me if you found a bug, if you have questions or if you’d like specific feature to be added, I tend to rely heavily on user feedback to improve my plugins! And if you’re happy with how it works, drop by on the WP DoNotTrack-page on wordpress.org to rate it and/ or to confirm it works with your version of WordPress!