Attentio tracking buzz, but language is a bitch

I am important! Or rather; some bloggers are important. Or better still; some advertisers, marketeers and PR-officers consider blogs to be an attentio logoimportant channel to communicate with and through. High-profile blogs (which this one is not by any measure) can indeed be instrumental in launching geeky products, kick-starting viral campaigns and in some cases even influencing the public debate. But what you can’t measure doesn’t exist and that’s where buzz tracking tools such as the one from Brussels-based Attentio comes into play.
Attentio spiders blogs, forums and news-sites and indexes all that content in what must be a super-sized database. In front of that database sits a data-mining application annex website, which allows communication-pro’s to follow-up on the positive and negative buzz around their products, product features and competitors on the “Brand dashboard” in real-time.
As straight-forward as this may seem, collecting all that content, filtering out the garbage (e.g. splogs and attentio dashboardNSFW-content) and creating a blazingly-fast web-based application to publish these reports on-the-fly is quite a feat. The demo I got last week during the Emakina/Reference Academy by Amaia Lasa and Kalina Lipinska was impressive enough to make me want to try the application myself in between sessions. Attentio’s Linda Margaret patiently “tomtommed” me through the interface (thanks Linda!), giving me a better overview of all the available graphs and screens. All in all an impressive product with a lot of potential, especially for multinationals that have a lot of blog-visibility.
A lot of potential? Yes, because there is room for improvement (isn’t there always?). Attentio is great for buzz-quantification, for showing how many blogs discuss your products, but I had the impression that reports that try to extract more then these “simple” quantifications, were still rough around the edges. This seems largely due to what is the basic building block of a blog; language.
There is, for example, a report which allows you to see buzz per region or country. For this qualification the domain-name and/or geo-location of the IP-address are used. But as anyone can choose a TLD of their liking ( and to name but two Flemish A-list bloggers) and as hosting abroad is no exception ( is hosted in the USA and this blog is on a server in Germany), a considerable amount of blogs in the reports I saw were not attributed to a country or region, but were instead classified by their language (Dutch/ French) in the same graph. Attentio intends to use information disclosed in the blog content itself to better pinpoint location.
Extracting non-quantitative information from blogs, forums and news-sites requires techniques from the fields of computational linguistics and artificial intelligence. One of the most exiting reports in the Brand Dashboard is the “sentiments”-report, which tries to categorize buzz as positive, neutral or negative. Up until now this is done using hard-coded rules which only allow content in English to be qualified (hence my writing this post in English, curious if this rings a bell on their own Brand Dashboard). Indeed Attentio is working at this, as witnessed by the description of the specialties of the smart Attentionistas on their “company info” page. They disclosed they’re working with the K.U. Leuven on new AI-based classification software (using Bayesian text classification one would suspect) which will be released into production later this year. I’m pretty sure this new software could be used for more then just extracting the “sentiment” of a blogpost, so I’ll certainly be keeping an eye on what these smart boys and girls are doing!
For those of you who would like to create some buzz-tracker graphs, Attentio offers basic functionality for free on Happy tracking!

Cache header magic (or how I learned to love http response headers)

Is your site dead slow? Does it use excessive bandwidth? Let me take you on a short journey to the place where you can lessen your worries: the http-response-headers, where some wee small lines of text can define how your site is loaded in cached.
So you’re interested and you are going to read on? In that case let me skip the foolishness (I’m too tired already) and move on to the real stuff. There are 3 types of caching-directives that can be put in the http response: permission-, expiry- and validation-related headers:

  1. Permission-related http response headers tell the caching algorithm if an object can be kept in cache at all. The primary way to do this (in HTTP/1.1-land) is by using cache-control:public, cache-control:private or cache-control:nocache. Nocache should be obvious, private indicates individual browser caches can keep a copy but shared caches (mainly proxies) cannot and public allows all caches to keep the object. Pre-http/1.1 this is mainly done by issuing a Last-Modified-date (Last-Modified; although that in theory is a validation-related cache directive) which is set in the future.
  2. The aim of expiry-related directives is to tell the caching mechanism (in a browser or e.g. a proxy) that upon reload the object can be reused without reconnecting to the originating server. These directives can thus help you avoid network roundtrips while your page is being reloaded. The following expiry-related directives exist: expires, cache-control:max-age, and cache-control:s-maxage. Expires sets a date/time (in GMT) at which the object in cache expires and will have to be revalidated. Cache-control:Max-age and Cache-control:s-maxage (both of which which take precedence if used in conjunction with expires) define how old an object may get in cache (using either the ‘age’ http response header or calculating the age using (Expires – Current date/time). s-maxage is to be used by shared caches (and takes precedence over max-age there), whereas max-age is used by all caches (you could use this to e.g. allow a browser to cache a personalised page, but prohibit a proxy from doing so). If neither expires, cache-control:max-age or cache-control:s-maxage are defined, the caching mechanism will be allowed to make an estimate (this is called “heuristic expiration“) of the time an object can remain in cache, based on the Last-Modified-header (the true workhorse of caching in http/1.0).
  3. Validation-related directives give the browser (or caching proxy) a means by which an object can be (re-)validated, allowing for conditional requests to be made to the server and thus limiting bandwith usage. Response-headers used in this respect are principally Last-Modified (date/timestamp the object was … indeed modified the last time) and ETag (which should be a unique string for each object, only changing if the object got changed).

And there you have it, those are the basics. So what should you do next? Perform a small functional analysis of how you want your site (html, images, css, js, …) to be cached at proxy or browser-level. Based on that tweak settings of your webserver (for static files served from the filesystem, mostly images, css, js) to allow for caching. The application that spits out html should include the correct headers for your pages so these can be cached as well (if you want this to happen, off course). And always keep in mind that no matter how good you analyze your caching-needs and how well you set everything up, it all depends on the http-standards (be it HTTP/1.0 or 1.1) the caching applications follows (so you probably want to include directives for both HTTP/1.0 and 1.1) and how well they adhere to those standards … Happy caching!
Sources/ read on:

(Disclaimer: there might well be some errors in here, feel free to correct me if I missed something!)

Wall Street Journal; more murders in Belgium than in US? Wrong!

On the 22nd of august, renowned journalist and neo-conservative columnist Bret Stephens wrote an editorial (“The Many Faces of Belgian Fascism”) for the Wall Street Journal in which he stated:

“Belgium’s per capita murder rate, at 9.1 per 100,000 is nearly twice that of the U.S.”.

After some websearching, this figure proved utterly wrong. An extract of the mail I sent Mr. Stephens in that respect:

The figure of 9,1 per 100.000 is not correct, as it is based on figures that include attempted -i.e. unsuccessful- murders. indeed, in 2005 the following statistics applied:

  • Total number of ‘successful’ murders: 174
  • Total attempted (i.e. ‘unsuccessful’) murders: 770
  • Total ‘successful and unsuccessful’ murders: 944

(source: figures in a French-language pdf from the site of the Belgian police. See page 2 under “Infr. contre l’integrite physique”, where ‘Acc.’ stands for successful and ‘Tent.’ for attempted, unsuccessful).
So when calculating the murder ratio based on approximately 10.500.000 inhabitants, this figure is not 9 but 1,7 per 100.000. And 1,7 is -not coincidently- the exact number a UN-report mentioned for homicide-ratio in 1996 in Belgium (as mentioned on
This means that the correct comparison between murder rates in Belgium and the USA (5,5/100.000 in 2004, cfr. is not “Belgium’s per apita murder rate is nearly twice that of the USA” but “… is 3 times lower than that of the USA”, which off course places the pervasive and growing sense of lawlessness” you mention in an entirely different perspective.
I hope this information can be of further use for to and your sources. Do not hesitate to contact me in case you have further remarks or questions about this matter.
Kind regards,
Frank Goossens

I do not agree with most of what Mr. Stephens writes in the rest of his column, but this mainly boils down to a -vast- difference in political beliefs. It is a pity however that, being the high-profile journalist he is, Mr. Stephens did not check the ‘facts’ in his editorial better than he did. He may need to question the reliability of his sources, even if he shares their ideology …