PHP HTML parsing performance shootout; regex vs DOM

As I wrote earlier an Autoptimize user proposed to switch from regular expression based script & style extraction to using native PHP DOM functions (optionally with xpath). I created a small test-script to compare performance and the DOM methods are on average 500% slower than the preg_match based solution. Here are some details;

  • There are 3 tests; regular expression-based (preg_match), DOM + getElementsByTagName and DOM + XPath. You can see the source here and see it in action here.
  • The code in all 3 testcases does what Autoptimize does to start with when optimizing JavaScript:
    1. extract all javascript (code if inline, url if external) and add it to an array
    2. remove the javascript from the HTML
  • With each load of the test-script, the 3 tests get executed 100 times and total time per method is displayed.
  • That test-script was run 5 times on 3 different HTML-files; one small mobile page with some JavaScript and two bigger desktop ones with lots of JS.

The detailed results;

total time regextotal time domtotal time dom+xpath
arturo’s HP0.6114.83664.977
deredactie HP2.33225.6155.879
m deredactie HP0.06960.46040.4558

So while parsing HTML with regular expressions might be frowned upon in developer communities (and rightly so, as a lot can go wrong with PCRE in PHP) it is vastly superior with regards to performance. In the very limited scope of Autoptimize, where the regex-based approach is tried & tested on thousands of blogs, using DOM would simply create too much overhead.

4 thoughts on “PHP HTML parsing performance shootout; regex vs DOM”

  1. implementation of a REGEX when dealing with HTML to handle all use-cases without experiencing any edge-cases is difficult (might “greedily” match more than intended)

    Reply
    • Maybe, but Autoptimize has a huge amount of experience (10 years, running on over 1 million sites by now) in that context, so it *can* be done with a reasonable amount of success 😉

      Reply
  2. Hi Frank,

    The best explanation why you shouldn’t parse HTML with regex, can be found here. 😜

    On a more serious note, I’m surprised at the DOM+xpath results. I’ve seen more negligible results in other tests (~.5 second difference)

    Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.