Simple HTML DOM Parser not that simple

Notwithstanding the name, using PHP Simple HTML DOM Parser isn’t always simple. While working on some issues with WP DoNotTrack‘s SuperClean mode, I encountered these two quirks:

  1. By default Simple HTML DOM removes linebreaks. That means that when you write the modified DOM back to a string for outputting, some (sloppy) JavaScript is bound to break. The solution: pass extra arguments to the DOM-creating functions, as “documented” in the Simple HMTL DOM’s source code. For str_get_html it reads:
    function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
    

    Set the 5th argument to false to tell the parser not to remove “\r\n”‘s.

  2. Simple HTML DOM is very liberal. It is so liberal, in fact, that it will try to make a DOM out of whatever you throw at it, without even blinking. Until you try to find elements using “find” on the DOM Object, that is, because at that point you might get a “Fatal error: Call to a member function find() on a non-object“-error thrown back at you. You can avoid that nastiness by checking the object for the existence of the find-method and, while you’re at it, also check if there is a HTML-element in the DOM:
    $html = file_get_html('http://url.to/filename.html');
    // first check if $html->find exists
    if (method_exists($html,"find")) {
         // then check if the html element exists to avoid trying to parse non-html
         if ($html->find('html')) {
              // and only then start searching (and manipulating) the dom
         }
    }

So that’s how to put the simple back into PHP Simple HTML DOM Parser. Until the next quirk comes up, because that’s what parsing HTML is all about after all, no?

7 thoughts on “Simple HTML DOM Parser not that simple

  1. Johnny Lai

    Hi Frank,

    Thank you for explaining the ‘PHP Simple HTML DOM Parser and the ‘Simple(Difficult) HMTL DOM’. It’s not that easy as you point out.

    I have tried to implement your code in mine, but I still get the same error ‘Fatal error: Call to a member function find() on a non-object’

    Do you have the time to take a look at my code to see what i have done wrong? I’m willing to pay you for your time if you want that.

    Thanks!

    Regards,
    Johnny Lai

    Reply
    1. frank Post author

      I don’t take money from strangers, but sure I’d be willing to have a look at your code (no guarantees though, my experience with PHP & Simple HTML DOM parser is limited to simple trial-and-error deelopement). You can mail me at futtta at gmail dot com.

      Reply
  2. Paul

    Thank you for this very informative post! I spent several hours trying to solve the infamous Fatal Error problem, and your method_exists($html,”find”) check is what fixed it for me. Thanks again!

    Reply
  3. cityalex

    $url = ‘http://www.example.com’;
    $context = stream_context_create(array(‘http’ => array(‘header’ => ‘User-Agent: Mozilla compatible’)));
    $response = file_get_contents($url, false, $context);
    $html = str_get_html($response);

    This solve you problem, you have ssl to access.

    Simple is simple.

    Reply
    1. frank Post author

      errr … it’s been almost 4 years since I wrote this post, but what of the above problems does your code solve cityalex?

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *