Is your string zipped?

While looking into a strange issue on a multisite WordPress installation which optimized the pages of the main but not of the sub-blogs, I needed code to check whether a string was gzipped or not. I found this like-a-boss code-snippet on StackOverflow which worked for gzencoded strings:

$is_gzip = 0 === mb_strpos($mystery_string , "\x1f" . "\x8b" . "\x08");

But this does not work for strings compressed with gzcompress or gzdeflate, which don’t have the GZIP header data, so in the end I came up with this less funky function which somewhat brutally simply tries to gzuncompress and gzinflate the string:

function isGzipped($in) {
  if (mb_strpos($in , "\x1f" . "\x8b" . "\x08")===0) {
    return true;
  } else if (@gzuncompress($in)!==false) {
    return true;
  } else if (@gzinflate($in)!==false) {
    return true;
  } else {
    return false;
  }
}

Klunky, but it works. Now if only this would confirm my “educated guess” that the original problem was due to a compressed string.

PHP HTML parsing performance shootout; regex vs DOM

As I wrote earlier an Autoptimize user proposed to switch from regular expression based script & style extraction to using native PHP DOM functions (optionally with xpath). I created a small test-script to compare performance and the DOM methods are on average 500% slower than the preg_match based solution. Here are some details;

  • There are 3 tests; regular expression-based (preg_match), DOM + getElementsByTagName and DOM + XPath. You can see the source here and see it in action here.
  • The code in all 3 testcases does what Autoptimize does to start with when optimizing JavaScript:
    1. extract all javascript (code if inline, url if external) and add it to an array
    2. remove the javascript from the HTML
  • With each load of the test-script, the 3 tests get executed 100 times and total time per method is displayed.
  • That test-script was run 5 times on 3 different HTML-files; one small mobile page with some JavaScript and two bigger desktop ones with lots of JS.

The detailed results;

total time regextotal time domtotal time dom+xpath
arturo’s HP0.6114.83664.977
deredactie HP2.33225.6155.879
m deredactie HP0.06960.46040.4558

So while parsing HTML with regular expressions might be frowned upon in developer communities (and rightly so, as a lot can go wrong with PCRE in PHP) it is vastly superior with regards to performance. In the very limited scope of Autoptimize, where the regex-based approach is tried & tested on thousands of blogs, using DOM would simply create too much overhead.

Some HTML DOM parsing gotchas in PHP’s DOMDocument

Although I had used Simple HTML DOM parser for WP DoNotTrack, I’ve been looking into native PHP HTML DOM parsing as a possible replacement for regular expressions for Autoptimize as proposed by Arturo. I won’t go into the performance comparison results just yet, but here’s some of the things I learned while experimenting with DOMDocument which in turn might help innocent passers-by of this blogpost.

  • loadHTML doesn’t always play nice with different character encodings, you might need something like mb_convert_encoding to work around that.
  • loadHTML will try to “repair” your HTML to make sure an XML-parser can work with it. So what goes in will not come out the same way.
  • loadHTML will spit out tons of warnings or notices about the HTML not being XML; you might want to suppress error-reporting by prepending the command with an @ (e.g. @$dom->loadHTML($htmlstring);)
  • If you use e.g. getELementsByTagName to extract nodes into a seperate DomNodeList and you want to use that to change the DomDocument can result in … unexpected behavior as the DomNodeList gets updated when changes are made to the DomDocument. Copy the DomNodes from the DomNodeList into a new array (which will not get altered) and iterate over that to update the DomDocument as seen in the example below.
  • removeChild is a method of DomNode, not of DomDocument. This means $dom->removeChild(DomNode) will not work. Instead invoke removeChild on the parent of the node you want to remove as seen in the example below
// loadHTML from string, suppressing errors
$dom = new DOMDocument();
@$dom->loadHTML($html);
// get all script-nodes
$_scripts=$dom->getElementsByTagName("script");
// move the result form a DomNodeList to an array
$scripts = array();
foreach ($_scripts as $script) {
   $scripts[]=$script;
}
// iterate over array and remove script-tags from DOM
foreach ($scripts as $script) {
   $script->parentNode->removeChild($script);
}
// write DOM back to the HTML-string
$html = $dom->saveHTML();

Now chop chop, back to my code to finish that performance comparison. Who know what else we’ll learn 😉

Irregular Expressions have your stack for lunch

I love me some regular expressions (problems), but have you ever seen one crash Apache? Well I have! This regex is part of YUI-CSS-compressor-PHP-port, the external CSS minification component in Autoptimize, my WordPress JS/CSS optimization plugin:
/(?:^|\})(?:(?:[^\{\:])+\:)+(?:[^\{]*\{)/)/
yo regex dawgExecuting that on a large chunk of CSS (lots of selectors for one declaration block, which cannot be ripped apart) triggers a stack overflow in PCRE, which crashes Apache and shows up as a “connection reset”-error in the browser.
Regular expression triggered segfaults are no exception in the PHP bugtracker and each and every of those tickets gets labeled “Not a bug” while pointing the finger at PCRE, which in their man-pages and in their bug tracker indeed confirm that stack overflows can occur. This quote from that PCRE bug report says it all, really;

If you are running into a problem of stack overflow, you have the
following choices:
  (a) Work on your regular expression pattern so that it uses less
      memory. Sometimes using atomic groups can help with this.
  (b) Increase the size of your process stack.
  (c) Compile PCRE to use the heap instead of the stack.
  (d) Set PCRE's recursion limit small enough so that it gives an error
      before the stack overflows.

Are you scared yet? I know I am. But this might be a consolation; if you test your code on xampp (or another Apache on Windows version), you’re bound to detect the problem early on, as the default threadstacksize there is a mere 1MB instead of the whopping 8MB on Linux.
As for the problem in YUI-CSS-compressor-PHP-port; I logged it on their Github issue-list and I think I might just have a working alternative which will be in Autoptimize 1.8.

Simple HTML DOM Parser not that simple

Notwithstanding the name, using PHP Simple HTML DOM Parser isn’t always simple. While working on some issues with WP DoNotTrack‘s SuperClean mode, I encountered these two quirks:

  1. By default Simple HTML DOM removes linebreaks. That means that when you write the modified DOM back to a string for outputting, some (sloppy) JavaScript is bound to break. The solution: pass extra arguments to the DOM-creating functions, as “documented” in the Simple HMTL DOM’s source code. For str_get_html it reads:
    function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
    

    Set the 5th argument to false to tell the parser not to remove “\r\n”‘s.

  2. Simple HTML DOM is very liberal. It is so liberal, in fact, that it will try to make a DOM out of whatever you throw at it, without even blinking. Until you try to find elements using “find” on the DOM Object, that is, because at that point you might get a “Fatal error: Call to a member function find() on a non-object“-error thrown back at you. You can avoid that nastiness by checking the object for the existence of the find-method and, while you’re at it, also check if there is a HTML-element in the DOM:
    $html = file_get_html('http://url.to/filename.html');
    // first check if $html->find exists
    if (method_exists($html,"find")) {
         // then check if the html element exists to avoid trying to parse non-html
         if ($html->find('html')) {
              // and only then start searching (and manipulating) the dom
         }
    }

So that’s how to put the simple back into PHP Simple HTML DOM Parser. Until the next quirk comes up, because that’s what parsing HTML is all about after all, no?

Some resources about PHP on App Engine

There isn’t a whole lot of documentation about running PHP on Google App Engine to be found, but these links may be helpful (in a “note to self: I should try this when I find the time” kind of way):

If you stumble on these pages because you’re thinking about creating and deploying a PHP application on Google App Engine, you should take into account that although Quercus looks promising, there are bound to be bugs that you’ll sooner or later run into. There are, for example, issues with determining file length and there are problems with variable referencing in some rare/ complex cases. You have been warned! 😉

Google App Engine project template for PHP (with Quercus)

So you’re a wanna-be developer who’d love to deploy in the cloud, but you only  “know” PHP? Well, as you might already have read elsewhere Caucho, the company behind Resin, has a 100% Java GPL’ed implementation of PHP5 called Quercus that can be used to run PHP on GAE. It took me some time to put the pieces of the puzzle together, but in the end it’s pretty straightforward.
From scratch to a deployed webapp in 7 steps:

  1. Download & install the Google App Engine SDK
  2. Download this GAE project template for PHP and unzip it in the root of the SDK directory as  projects/phptemplate/
  3. Put your PHP-files in projects/phptemplate/war/ (you probably want to overwrite index.php and remove phpinfo.php)
  4. Test you application locally with dev_appserver as described here
  5. Login on https://appengine.google.com/ and register a new application
  6. Put the app id from (5) in projects/phptemplate/war/WEB-INF/appengine-web.xml, between the <application>-tage
  7. Upload your application as described here: appcfg –enable_jar_splitting update <path-to-war> (–enable_jar_splitting is needed as the WEB-INF/lib/resin.jar is rather big)

And there you have it, your very own PHP-app on GAE! Check out the Quercus info on  on how you can access Java components from within you PHP-code, it might come in very handy to use GAE’s Java API’s for the datastore, queues and all those other goodies!
(Disclaimer: while this here template seems to work, I can’t make any promises or provide any kind of warranty.  As soon as you download it, you assume all responsibilities for any problems you might cause to the Internet, GAE or the Ozone-layer.)

PHP OAuth extension: trial, error and success

I’ve been experimenting with the PHP OAuth PECL extension over the last few days and initially ran into some small problems getting it to function correctly. So for the sake of making this world wide web an even better place, here are some error-messages you might encounter and what you could do to resolve them:

  • “OAuth::getRequestToken() URL file-access is disabled in the server configuration” → the OAuth extension by default uses fopen (streams) to fetch data from the OAuth provider (server), but you’re probably slightly paranoid (or is it security- conscious?) and have allow_url_fopen = 0 in you php.ini. Just change that value to “1” and reload your webserver and then see …
  • SSL: fatal protocol error → OAuth seems to be working fine, but for reasons unknown OAuth insists something fatal just happened. You could ignore this one, or try to lower the error-reporting value or do as I did and decide to use OAuth::setRequestEngine to switch from those doomed streams to almighty Curl, until …
  • “setRequestEngine expects parameter to be long, string given” → so you installed (compiled) the OAuth-extension, but you didn’t have that libcurl-dev package installed, did you? No matter how polite you request that Curl-engine, it is just not going to happen unless you install libcurl-dev, and then re-installed OAuth.

And after successfully re-installing OAuth, this time with Curl-support, you could try to build a small application, accessing GMail maybe?

Fun with caching in PHP with APC (and others)

After installing APC, I looked through the documentation on php.net and noticed 3 interesting functions with regards to session-independent data caching in PHP;

When talking about caching, apc_delete might not be that important, as apc_store allows you to set the TTL (time to live) of the variable you’re storing. If you try to retrieve a stored variable which exceeded the TTL, APC will return FALSE, which tells you to update your cache.
All this means that adding 5 minutes worth of caching to your application could be as simple as doing;

if (($stringValue=apc_fetch($stringKey)) === FALSE) {
$stringValue = yourNormalDogSlowFunctionToGetValue($stringKey);
apc_store($stringKey,$stringValue,300);
}

From a security point-of-view however (esp. on a shared environment) the APC-functions should be considered extremely dangerous. There are no mechanisms to prevent a denial of service; everyone who “does PHP” on a server can fill the APC-cache entirely. Worse yet, using apc_cache_info you can get a list of all keys which you in turn can use to retrieve all associated values, meaning data theft can be an issue as well. But if you’re on a server of your own (and if you trust all php-scripts you install on there), the APC-functions can be sheer bliss!
And off course other opcode caching components such as XCache and eAccelerator offer similar functionality (although it’s disabled by default in eAccelerator because of the security concerns).

Trading eAccelerator for APC

Yesterday I somewhat reluctantly removed eAccelerator from my server (Debian Etch) and installed APC instead. Not because I wasn’t satisfied with performance of eAccelerator, but because the packaged version of it was not in the Debian repositories (Andrew McMillan provided the debs), and those debs weren’t upgraded at the same pace and thus broke my normal upgrade-routine. Moreover APC will apparently become a default part of PHP6 (making the Alternative PHP Cache the default opcode cache component). Installation was as easy as doing “pecl install apc” and adding apc to php.ini. Everything seems to be running as great as it did with eAccelerator (as most test seem to confirm).