Category Archives: Internet

All blogposts on blog.futtta.be about Internet (browsers, web development and mobile web).

Some HTML DOM parsing gotchas in PHP’s DOMDocument

Although I had used Simple HTML DOM parser for WP DoNotTrack, I’ve been looking into native PHP HTML DOM parsing as a possible replacement for regular expressions for Autoptimize as proposed by Arturo. I won’t go into the performance comparison results just yet, but here’s some of the things I learned while experimenting with DOMDocument which in turn might help innocent passers-by of this blogpost.

  • loadHTML doesn’t always play nice with different character encodings, you might need something like mb_convert_encoding to work around that.
  • loadHTML will try to “repair” your HTML to make sure an XML-parser can work with it. So what goes in will not come out the same way.
  • loadHTML will spit out tons of warnings or notices about the HTML not being XML; you might want to suppress error-reporting by prepending the command with an @ (e.g. @$dom->loadHTML($htmlstring);)
  • If you use e.g. getELementsByTagName to extract nodes into a seperate DomNodeList and you want to use that to change the DomDocument can result in … unexpected behavior as the DomNodeList gets updated when changes are made to the DomDocument. Copy the DomNodes from the DomNodeList into a new array (which will not get altered) and iterate over that to update the DomDocument as seen in the example below.
  • removeChild is a method of DomNode, not of DomDocument. This means $dom->removeChild(DomNode) will not work. Instead invoke removeChild on the parent of the node you want to remove as seen in the example below
// loadHTML from string, suppressing errors
$dom = new DOMDocument();
@$dom->loadHTML($html);

// get all script-nodes
$_scripts=$dom->getElementsByTagName("script");

// move the result form a DomNodeList to an array
$scripts = array();
foreach ($_scripts as $script) {
   $scripts[]=$script;
}

// iterate over array and remove script-tags from DOM
foreach ($scripts as $script) {
   $script->parentNode->removeChild($script);
}

// write DOM back to the HTML-string
$html = $dom->saveHTML();

Now chop chop, back to my code to finish that performance comparison. Who know what else we’ll learn ;-)

How to keep Autoptimize’s cache size under control (and improve visitor experience)

Confession time: Autoptimize does not have its proper cache purging mechanism. There are some good reasons for that (see below) but in most cases this is not something to worry about.

Except when it is something to worry about off course. Because in some cases the amount of cache-files generated by Autoptimize can grow to several Gigabytes. Why, you might wonder? Well, for each page being loaded Autoptimize aggregates all JS (and CSS) calculates the hash of that string and checks if an optimized version is in cache using that hash. If there is a difference (even if just a comma), the hash is not the same and the aggregated CSS/ JS is cached seperately. This behavior typically is caused by plugins that generate javascript-variables (or CSS-selectors) that are specific for each page (or even worse, for each page request). That does not only lead to a huge amount of files in the cache, but also impacts visitors as their browsers will have to request a different optimized CSS- or JS-file for each page instead of reusing the same file for several pages.

This is what you can do if you want a healthier cache both from a server- and visitor-perspective (based on JavaScript, but the same principle applies to CSS);

  1. Open two similar pages (posts).
  2. View source of the optimized JavaScript in those two pages.
  3. Copy the source of each to a seperate file and replace all semi-colons (“;”) with semi-colon+linefeed (“;\n”) in both files.
  4. Execute an automatic comparison between the two using e.g. diff (or “compare” in Notepad++), this should give you one or more lines that will probably be almost the same, but not exactly (e.g. with a different nonce or a postid in them).
  5. Now disable JS optimization and look for similar strings in the inline and the external JavaScript.
  6. If you find it in the inline JavaScript, try to identify a unique string in there (the name of a specific variable, probably) and write that down. If the variable JS is in a file, jot down the filename.
  7. Go to the autoptimize settings page and make sure the advanced settings are shown.
  8. Now add the strings or filenames from (6) to “Exclude scripts from Autoptimize:” (which is a comma-seperated list).
  9. Re-enable JS optimization.
  10. Save settings & clear cache.

This does require some digging, but the advantages are clear; a (much) smaller cache-size on disk and better performance for your visitors. Everyone will be so happy, people will want to hug you and there will be much rejoicing, generally.

So why doesn’t Autoptimize have automatic cache pruning? Well, the problem is a page caching layer (which could be a browser, a caching reverse proxy or a wordpress page caching plugin) contains pages that refer to the aggregated JS/CSS-files. If those optimized files were to be automatically removed while the page would remain in the page caching layer, people would get the cached page without any JS- or CSS-files being available. And as I don’t want Autoptimize to break your pages, I didn’t include a automatic cache purging mechanism. But if you have a bright idea of how this problem could be tackled, I’d be happy to reconsider, off course!

Should you inline or defer blocking CSS?

CSS codeYou care about web performance and so you dutifully aggregate and minify your CSS. But then a couple of months ago Google PageSpeed Insights (for mobile) started identifying CSS as a render blocking resource and so you wonder if you should you inline your CSS, or even defer loading it. Based on tests executed on a multi-site WordPress installation, deferring CSS is not the best idea just yet, but inlining might be worthwhile! Read on for the hard numbers and other details.

The test-setup

  • 4 test-blogs on a multi-site WordPress instance
  • using the Expound theme (which is interesting because its main stylesheet imports 2 other CSS-files)
  • using Lite Cache for page caching (new WordPress page caching plugin by the Hyper Cache author)
  • the same content was imported on all 4 blogs
  • all 4 had Autoptimize HTML & JS optimization active
  • the difference was in the Autoptimize CSS settings, where:
    • blog 1 had no CSS optimization at all (baseline )
    • blog 2 had standard CSS aggregation and minification, linked in head
    • blog 3 inlined the optimized CSS
    • blog 4 deferred the optimized CSS
  • each blog was tested on webpagetest.org‘s Amsterdam node on a DSL-profile using IE9 and doing 9 test-runs on one specific blogpost with contained a 16KBimage (I excluded favicon.ico as it seemed to pollute results)
  • each blog was analyzed by Google PageSpeed Insights for both mobile & desktop

The results

test report urlfirst bytestart renderdoc completefully loadedmobile pagespeeddesktop pagespeed
1. no css optimization140212_Z1_MKN0.299s2.246s2.221s2.221s7992
2. optimized CSS linked140212_H7_MKP0.239s0.608s1.390s1.390s9197
3. optimized CSS inlined140212_A3_NJA0.232s0.348s0.658s0.658s9999
4. optimized CSS deferred 140212_8J_P1G0.248s0.357s1.034s1.034s9995

The conclusions

Based on these tests (your mileage may vary, always test your results):

  • deferring all CSS is useless, performance is worse, desktop PageSpeed score is (slightly) lower and there is a “flash of unstyled content” between the rendering of the page and the application of the CSS.
  • Inlining CSS yields the best results both from a page speed and PageSpeed perspective. Although the base HTML is larger as it has the CSS payload, this has almost no impact in this specific context and rendering is almost instantaneous. Off course in a context where multiple other pages from the same site, with the exact same CSS would be loaded, the impact would be significant. Hence inlining CSS is especially interesting for sites with a low pageviews/ visitor ratio.

The future; inline + defer

Deferring CSS may seem pretty useless, but the sweet spot may just be inlining base CSS (everything needed for initial rendering above the fold) and deferring everything else. This is what CSS optimizing tools should focus on in 2014 and you can certainly expect something along these lines in one of the next major Autoptimize releases (although diehards can already test this approach).

WP YouTube Lyte 1.4.0; support for accessibilityFeature

A new version of WP YouTube Lyte was released over the weekend. Benetech, a U.S. nonprofit that develops and uses technology to create positive social change, offered a patch that adds the accessibilityFeature property to videos that have captions. If you have microdata enabled, WP YouTube Lyte now will automatically check if captions are available and if so, adds the accessibilityFeature property with value “captions” to the HTML-embedded microdata.

As YouTube only offers information on captions in their API v3, which requires authentication, the check is done in a separate, asynchronous call via a proxy-webservice on api.a11ymetadata.org. You can see an example of what the response looks like here and look at the source code on Github. This webservice-request for captions can be disabled by simply switching off microdata. Alternatively, if you want microdata but not the accessibilityFeature-property, you can use the “lyte_docaptions”-filter to set captions to false (you can find an example in lyte_helper.php_example).

This was the first time someone actively submitted code changes to add functionality to a project of mine, actually. Working with Benetech was a breeze and GitHub is a great platform to share, review and comment code. I for one am looking forward to more high-quality contributions like this one!

Irregular Expressions have your stack for lunch

I love me some regular expressions (problems), but have you ever seen one crash Apache? Well I have! This regex is part of YUI-CSS-compressor-PHP-port, the external CSS minification component in Autoptimize, my WordPress JS/CSS optimization plugin:

/(?:^|\})(?:(?:[^\{\:])+\:)+(?:[^\{]*\{)/)/

yo regex dawgExecuting that on a large chunk of CSS (lots of selectors for one declaration block, which cannot be ripped apart) triggers a stack overflow in PCRE, which crashes Apache and shows up as a “connection reset”-error in the browser.

Regular expression triggered segfaults are no exception in the PHP bugtracker and each and every of those tickets gets labeled “Not a bug” while pointing the finger at PCRE, which in their man-pages and in their bug tracker indeed confirm that stack overflows can occur. This quote from that PCRE bug report says it all, really;

If you are running into a problem of stack overflow, you have the
following choices:

  (a) Work on your regular expression pattern so that it uses less 
      memory. Sometimes using atomic groups can help with this.
  (b) Increase the size of your process stack.
  (c) Compile PCRE to use the heap instead of the stack.
  (d) Set PCRE's recursion limit small enough so that it gives an error
      before the stack overflows.

Are you scared yet? I know I am. But this might be a consolation; if you test your code on xampp (or another Apache on Windows version), you’re bound to detect the problem early on, as the default threadstacksize there is a mere 1MB instead of the whopping 8MB on Linux.

As for the problem in YUI-CSS-compressor-PHP-port; I logged it on their Github issue-list and I think I might just have a working alternative which will be in Autoptimize 1.8.

Looking at 2013 disappearing fast in the rear view mirror

rearview mirror by Alvaro on FlickrAnother year behind us, another overview in numbers (as done previously for 2011 and 2012).

On a personal note, 2013 has not been the easiest of years, but our lovely daughter’s “lentefeest” (a non-religious rite of passage for 6-year olds) and our holiday in Italy were great highlights though.