Ranting & Raving at Drupal Summit 2011

I attended Drupal Summit in Genk a couple of days ago and amidst the general “Drupal is the best thing since sliced bread” atmosphere, there were some interesting discussions about the platform’s maturity. Especially the presentations of Peter Van Den Broeck (for VRT) and Wouter Mertens (for competitor VMMA) seemed to be on opposite sides, with VMMA having multiple successful Drupal-sites in production and VRT struggling to get their projects finished, telling Captain Buytaert “Dries, we’re not there yet“. But underneath the surface and despite the differences (a dedicated team of sysadmins & developers at VMMA vs fixed price, project-based external development for VRT), both were talking about the same problems on a technical level; modules & performance.
The Drupal module community is a great bunch of enthusiastic developers, building an impressive number of modules that cover almost any feature one would want to have in a website. But module quality, support & compatibility varies enormously. Some modules seem to be true Rube Goldberg machines, providing tons of functionality that only few people need but which makes the the UI a usability-hell and the code complex, error-prone and possibly a real performance-hog.
And while we’re on the subject; performance doesn’t come out of the box and Drupal as such does not scale very well. Install it with a bunch of modules, generate a reasonable amount of page requests and you have a CPU-intensive system that generates a crapload of database-connections. Adding memcache between your Drupal & MySQL helps, as most requests will be handled from cache. And putting a caching reverse proxy (Varnish, Squid or even Apache’s mod_proxy+mod_cache if you insist) in front of your Drupal does miracles, serving visitors the same content without the need for Drupal to bootstrap. So sure, you can build a scalable solution that provides great performance, but one could say this is despite Drupal, not because of it. After all, when using Memcache & Varnish almost any CMS will have great performance, won’t it?
So yeah, Drupal can be a nice solution to your problem, but it does require more than just a superficial knowledge of how to install it together with some modules and a theme. Make sure there are smart people on your team or project, that have a profound knowledge of modules & module development and who know a lot about MySQL (and clustering, mirroring or master/slave setups), Memcache and Varnish (or Squid, forget about mod_cache if you can). You’re bound to run into some problems, as Peter Van Den Broeck confirmed, but with the right people and architecture your Drupal-project can indeed be the best thing since sliced bread.

3 Apache mod_cache gotchas

If you want to avoid the learning curve of Squid and Varnish or the cost of a dedicated caching & proxying appliance, using Apache with mod_cache may seem like a good, simple and cheap solution. Rest assured, it can be -to some extent- but here are 3 gotchas I learned the hard way:

  1. mod_cache ignores Cache-control if Expires is in the past (which it shouldn’t according to RFC2616), so you might have to unset the Expires-header.
  2. mod_cache by default caches cookies! Let me repeat; cookies are cached! That might be a huge security-disaster waiting to happen; sessionid’s (that provide access for logged-on users) are generally stored in cookies. If a logged on user that request an uncached page, then that user’s cookie will get cached and sent to other users that request the same page. Do disable this by adding “CacheIgnoreHeaders Set-Cookie” to your config
  3. mod_cache by default treats all browsers like the one that triggered the caching of the object. In the field that approach can cause problems with e.g. CSS-files that are stored gzipped (because the first browser requested with header “Accept-Encoding: gzip, deflate”). If a browser that does not support gzipped content requests the same file, the CSS will be unreadable and thus not applied. The solution; make sure the “backend webserver” sends the “Vary: Accept-Encoding” header in the response (esp. for CSS-files). This will tell mod_cache to take different Accept-Encodings into account, storing and sending different versions of the same CSS-file.

Drupal, mod_cache & RFC2616 caching

Suppose you’re setting up a Drupal-based site for which you have to implement a caching reverse proxy and for reasons beyond your comprehension Varnish (or even Squid) are not an option. Oh no, you’re stuck with Apache’s mod_proxy and mod_cache! What should you do?
First of all, Drupal 6 doesn’t like reverse proxies. If you don’t want to wait for version 7, which should do better in this respect, you might want to look at Pressflow. This Drupal 6 “distro” has everything on board to work with reverse proxies. So install Pressflow (or try to apply this out of date diff to stock Drupal) and in the Performance-screen set “Caching Mode” to “External” and “Page Cache Maximum Age” to the number of minutes you consider a cached page valid. Voila, you’re done in Drupal (edit: almost, as you might also want to change the $base_url in sites/default/settings.php to reverse proxy URL after you configured Apache).
Next up: Apache! A simple configuration like this one should do the trick:

ProxyRequests Off
ProxyPass /rp_drupal http://localhost/pressflow
ProxyPassReverse /rp_drupal http://localhost/pressflow
CacheEnable disk /rp_drupal/
CacheRoot c:/TEMP/apacache
CacheDefaultExpire 3600

OK, this must surely work, no? Well it should, but it doesn’t! When setting your Apache-loglevel to debug you’ll see “not cached” entries in your error-log, with the following reason:

Expires header already expired, not cacheable

Expires in the past, what does Pressflow think it’s doing deep down in includes/bootstrap.inc?

// HTTP/1.0 proxies do not support the Vary header, so prevent any caching
// by sending an Expires date in the past. HTTP/1.1 clients ignores the
// Expires header if a Cache-Control: max-age= directive is specified (see RFC
// 2616, section 14.9.3).
drupal_set_header('Expires', 'Sun, 11 Mar 1984 12:00:00 GMT');
// [...]
$max_age = variable_get('cache', CACHE_DISABLED) == CACHE_AGGRESSIVE && (!isset($_COOKIE[session_name()]) || isset($hook_boot_headers['vary'])) ? variable_get('page_cache_max_age', 0) : 0;
$default_headers['Cache-Control'] = 'public, max-age=' . $max_age;

Darn, those Pressflow-guys seem to have read up on their RFC’s! And indeed, 2616 confirms that cache-control’s max-age overrules expires;

If a response includes both an Expires header and a max-age directive, the max-age directive overrides the Expires header, even if the Expires header is more restrictive. This rule allows an origin server to provide, for a given response, a longer expiration time to an HTTP/1.1 (or later) cache than to an HTTP/1.0 cache.

Mod_cache’s code seems to take a much simpler approach; at line 503 it decides not to cache based on an Expires-header in the past, totally dismissing the potential presence of cache-control’s max-age.

else if (exp != APR_DATE_BAD && exp < r->request_time)
    {
        /* if a Expires header is in the past, don't cache it */
        reason = "Expires header already expired, not cacheable";
    }

But you’re not interested in code which does or does not adhere to whatever RFC some spec-buffs came up with, you just want to cache your frigging’ Drupal-site! Well, fear not little hacker-boy, here’s some Apache-magic to cure your ailments, to be copy/pasted in the config before ProxyPass and ProxyPassReverse:

<Location /rp_drupal>
     SetEnvIf Request_Protocol "HTTP/1.1" expires_overrule
     # homework: add a SetEnvIf to see if cache-control max-age is present
     Header unset Expires env=expires_overrule
</Location>

So there you have it, a rudimentary caching setup for Drupal (in the guise of Pressflow) using nothing but Apache’s mod_proxy and mod_cache. Now go do your homework and test and do some finetuning and test some more. Happy caching!