As I wrote earlier an Autoptimize user proposed to switch from regular expression based script & style extraction to using native PHP DOM functions (optionally with xpath). I created a small test-script to compare performance and the DOM methods are on average 500% slower than the preg_match based solution. Here are some details;
- There are 3 tests; regular expression-based (preg_match), DOM + getElementsByTagName and DOM + XPath. You can see the source here and see it in action here.
- The code in all 3 testcases does what Autoptimize does to start with when optimizing JavaScript:
- extract all javascript (code if inline, url if external) and add it to an array
- remove the javascript from the HTML
- With each load of the test-script, the 3 tests get executed 100 times and total time per method is displayed.
- That test-script was run 5 times on 3 different HTML-files; one small mobile page with some JavaScript and two bigger desktop ones with lots of JS.
The detailed results;
total time regex | total time dom | total time dom+xpath | |
arturo’s HP | 0.611 | 4.8366 | 4.977 |
deredactie HP | 2.3322 | 5.615 | 5.879 |
m deredactie HP | 0.0696 | 0.4604 | 0.4558 |
So while parsing HTML with regular expressions might be frowned upon in developer communities (and rightly so, as a lot can go wrong with PCRE in PHP) it is vastly superior with regards to performance. In the very limited scope of Autoptimize, where the regex-based approach is tried & tested on thousands of blogs, using DOM would simply create too much overhead.