Watch out for APC caching in CakePHP

Recently I moved a popular site from it’s own host to my personal VPS as the original host was compromised by a Korean hacker who used it as a botnet.

My own VPS is set up using Apache’s mpm-itk. This allows me to run every VirtualHost under it’s own username:

<VirtualHost *:80>
    AssignUserId notes notes
    ServerName notes.bartv.be
    DocumentRoot /home/notes/wwwroot
</VirtualHost>
<VirtualHost *:80>
    AssignUserId bart bart
    ServerName bartv.be
    ServerAlias www.bartv.be
    DocumentRoot /home/bartv/wwwroot
</VirtualHost>

This is great in terms of File Permissions, CPU throttling, sandboxing, … I really see the advantage.

That’s why I also got really surprised when on my CakePHP application A, I saw a database error showing up from database CakePHP application B.
The error showing was basically saying it couldn’t find the database fields it was looking for in the table it was looking in. I looked a bit further and found out the following:

  • It was using the right codebase (It didn’t get confused in the DocumentRoot)
  • It was using the right connection (Other queries did function, and the fields it was looking for didn’t exist like it said)
  • The Cake Model it uses existed in both applications A and B under the same name!

 
After looking through the configuration, I found the following in the caching of both applications. (Look in core.php)

$engine = 'File';
if (extension_loaded('apc') && function_exists('apc_dec') && (php_sapi_name() !== 'cli' || ini_get('apc.enable_cli'))) {
	$engine = 'Apc';
}
//...
// Prefix each application on the same server with a different string, to avoid Memcache and APC conflicts.
$prefix = 'myapp_';

What was happening was, both applications were trying to write the cache of the PHP model to the APC cache as myapp_MyModel. So both were trying to use the same model (or Controller for that matter). Also, both were probably overwriting this file way too often, slowing down both applications.

Quickly change your own application’s prefix if you’re using APC and keep a mental not to change these variables for every new application!

PHP Edge-N-Gram or wordsplit using CakePHP

While working on a CakePHP project, I needed to implement search. Since the client had a PHP-only server, I couldn’t use a SOLR

server.

I decided to give MySQL Fulltext search a go!

MySQL fulltext is MySQL’s approach to an indexed text search. It works by splitting words on spaces and special characters. And it works quite fast at that! However, I was missing one feature: Word split. Or as they call it in SOLR terms: Edge N-Gram.
It works by splitting a long words into shorter word splits. This means if a word bedside is given, it will split the word in subwords like bed, beds, bedsi, bedsid and bedside.
Immediately you can tell the first 2 are very relevant for any queries coming in.

I decided to write my own little version of this. Do note this uses 1 PHP function: Sanitize::paranoid which strips all non-alphanumeric characters. The implementation of this function is the following in case you’re not using CakePHP:

public static function paranoid($string, $allowed = array()) {
	$allow = null;
	if (!empty($allowed)) {
		foreach ($allowed as $value) {
			$allow .= "\\$value";
		}
	}
 
	if (is_array($string)) {
		$cleaned = array();
		foreach ($string as $key => $clean) {
			$cleaned[$key] = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $clean);
		}
	} else {
		$cleaned = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
	}
	return $cleaned;
}

An implementation of edge-n-gram would be the following:

class IndexHelper{
 
	public static function edgeNGram($inputString, $allowedChars = array(' '), $minChars=2, $maxChars = 10){
		$output = Sanitize::paranoid($inputString, $allowedChars);
		$splits = array();
		$words = explode(" ", $output);
		foreach($words as $word){
			array_push($splits, IndexHelper::splitWord($word, $minChars, $maxChars));
		}
		return join(",",$splits);
	}
	private static function splitWord($input, $min, $max){
		$max = (strlen($input) < $max ? strlen($input):$max);
		$splits = array();
		for($i = $min; $i<=$max;$i++){
			$splits[] = substr($input, 0, $i);
		}
		return join(",",$splits);
	}
 
}

Tracking Facebook likes, shares and sends in Google Analytics

Following up on my previous post about Tracking External links with Google Analytics. Here’s something you can use to track Facebook likes and shares.

We’ll track clicks using the Facebook Events. Note that this only works if you’re using FBML!
Here’s the things we’re going to be tracking:

  • Somebody clicks the ‘Like’ button on your website to ‘Like’ your Facebook page
  • Somebody clicks the ‘Like’ button on your website to share the current page on his ‘Wall’
  • Somebody clicks the ‘Send’ button to share this page with some friends on Facebook

And it’s all in here:

FB.Event.subscribe("edge.create",function(response){
	if(response.indexOf("facebook.com") > 0){
                 //if the returned link contains 'facebook,com'. It's a 'Like' for your Facebook page
		_gaq.push(['_trackEvent','Facebook','Like',response]);
	}else{
                 //else, somebody is sharing the current page on their wall
		_gaq.push(['_trackEvent','Facebook','Share',response]);
	}
});
FB.Event.subscribe("message.send",function(response){
	_gaq.push(['_trackEvent','Facebook','Send',response]);
});

As you may have noticed, the Facebook event contains the liked/shared/sent link as response

Easily track outgoing links with jQuery and Google Analytics

Outgoing links on your websites cause 3 things:

  • Take linkjuice away from your page
  • Drive traffic away from your webiste
  • They don’t allow you to measure the amount of clicks on that link

The first problem, you have to sort manually by adding the rel=”nofollow” attribute to all external links, as recommended by Google.

The second and third problem, can be easily fixed by using jQuery. I’ll show you how:

Opening external links in a new tab

By opening those links in a new tab, your page stays open, therefore not driving the user away from your page.
Normally, this is done by manually appending target=”_blank” to each link. Let’s let jQuery do that for us.

$(document).ready(function(){
    $("a[@href^='http']").attr('target','_blank');
});

(Do note that I chose to define links starting with ‘http’ as external links)

Tracking clicks on external Links

We can track links by using Google Analytics Event tracking:

$(document).ready(function(){
    $("a[@href^='http']").click(function(){
        _gaq.push(['_trackEvent', 'External Link','Click', $(this).attr("href")]);
    });
});

Putting it all together

$(document).ready(function(){
    $("a[@href^='http']")
        .attr('target','_blank')
        .click(function(){
            _gaq.push(['_trackEvent', 'External Link','Click', $(this).attr("href")]);
        });
});

Help! My internal links also start with ‘http’

No problem, you can do like me and borrow this trick from Karl Swedberg:

$('a').filter(function() { return this.hostname && this.hostname !== location.hostname; })
.attr('target','_blank')
.click(function(){
    _gaq.push(['_trackEvent', 'External Link','Click', $(this).attr("href")]);
});

Google AppEngine translate bot

Google provided a few translation bots recently.
Since I do a lot of Dutch -> French translation, I quicly whipped up my own and deployed it to Google AppEngine:

For translating I used the unofficial google-api-translate-java jar file.
Here’s what you need to do:

Place the jar in war\WEB-INF\lib

war\WEB-INF\web.xml

<?xml version="1.0" encoding="utf-8"?>
<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:web="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5">
	<servlet>
		<servlet-name>Translate_Bot</servlet-name>
		<servlet-class>be.bartv.translatebot.Translate_BotServlet</servlet-class>
	</servlet>
	<servlet-mapping>
		<servlet-name>Translate_Bot</servlet-name>
		<url-pattern>/translate_bot</url-pattern>
	</servlet-mapping>
	<servlet>
		<servlet-name>xmppreceiver</servlet-name>
		<servlet-class>be.bartv.translatebot.XMPPReceiverServlet</servlet-class>
	</servlet>
	<servlet-mapping>
		<servlet-name>xmppreceiver</servlet-name>
		<url-pattern>/_ah/xmpp/message/chat/</url-pattern>
	</servlet-mapping>
	<welcome-file-list>
		<welcome-file>index.html</welcome-file>
	</welcome-file-list>
</web-app>

war\WEB-INF\appengine-web.xml

<?xml version="1.0" encoding="utf-8"?>
<appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
	<application>translate-bot</application>
	<version>1</version>
 
	<!-- Configure java.util.logging -->
	<system-properties>
		<property name="java.util.logging.config.file" value="WEB-INF/logging.properties"/>
	</system-properties>
	<inbound-services>
    	<service>xmpp_message</service>
  	</inbound-services>
 
</appengine-web-app>

src\be.bartv.translatebot.XMPPReceiverServlet.java

package be.bartv.translatebot;
 
import java.io.IOException;
 
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
 
import com.google.api.GoogleAPI;
import com.google.api.translate.Language;
import com.google.api.translate.Translate;
import com.google.appengine.api.xmpp.JID;
import com.google.appengine.api.xmpp.Message;
import com.google.appengine.api.xmpp.MessageBuilder;
import com.google.appengine.api.xmpp.XMPPService;
import com.google.appengine.api.xmpp.XMPPServiceFactory;
 
public class XMPPReceiverServlet extends HttpServlet{
	private static final long serialVersionUID = 2212159648921332999L;
	public void doPost(HttpServletRequest req, HttpServletResponse resp) throws IOException
	{
		XMPPService service = XMPPServiceFactory.getXMPPService();
		Message message = service.parseMessage(req);
		JID jid = message.getFromJid();
		String content 	= message.getBody();
 
		String reply = "Hmmz. I Should return a translation now";
		try {
			GoogleAPI.setHttpReferrer("http://notes.bartv.be/");
			reply = Translate.execute(content, Language.DUTCH, Language.FRENCH);
		} catch (Exception e) {
			reply = "Error occurred: "+e.getMessage();
			e.printStackTrace();
		}
		service.sendMessage(new MessageBuilder().withBody(reply).withRecipientJids(jid).build());
	}
}

Sometimes, I do get a error saying I violate Google’s Terms and conditions. No idea why tho …

Google Breadcrumbs come from … Breadcrumbs

Google sometimes shows breadcrumbs in their search results.
Instead of showing the URL of the page you’re going to visit (in the green bit), hey’ll show a path kind of like so:
Homepage › Category › Subcategory > Something something.
Many SEO experts have been trying to figure out just where they come from. So have I.
So recently, on one of my websites: promoties.be. I had the chance to figure it out.

All promotions are divided into categories and subcategories. The URL of these categories are always hierarchically mapped.
e.g. http://www.promoties.be/categorie/autos-motoren-27/aanhangwagens-1419 is a subcategory of http://www.promoties.be/categorie/autos-motoren-27 making the URL logically traversable.
However a specific offer or product under this category is not in this structure. Mainly because that would make the URL way too long.
An example is http://www.promoties.be/promotie-bw1-aanhangwagen-864809. As you can see, the categories do not appear in the URL.

However the breadcrumb for this offer is as follows: Home > Promoties > Auto’s & Motoren > Aanhangwagens > Bw1 aanhangwagen with all the nested categories intact.

Now the Google search result block for this specific offer was the following:

Google search result for "Bw1 aanhangwagen"

In this case we can conclude that Google took the breadcrumbs in the search results, from the breadcrumbs on the page.

More info on breadcrumbs:

http://www.google.com/support/webmasters/bin/answer.py?answer=185417

lib.js: Good practice and generally a good idea

I recently stumbled upon an article on Six Revisions titled Are Current web design trends pushing us back to 1999?.

I found it to be a very interesting article. It mainly talks about how new trends in the web are barking up old problems, like the Flash splash page, or the shoutbox.

On thing I found very interesting was the part called Modern-Day Bloated, Cut-And-Paste Scripts.

Being involved with jQuery on a day-to-day basis, you start using some plugins, or even write some of your own.
But once you start stacking plugins, the browser has to load all of these plugins, generating more request. Which is generally a good idea.
Bloated plugins

Now whenever creating a new webproject, I use one JS file: lib.js. This JS file contains everything I need, it’s like a swiss pocket knife!
Structure is usually like following (depends on your project needs):

  • jQuery
  • jQuery UI
  • Plugins
  • $(document).ready(function(){ /**magic here **/});

You could argue by saying: but doesn’t the filesize increase by a lot, letting the user download a 250k file is quite a lot!
I agree, but play your cards right in server configuration with a little help from Google’s mod_pagespeed or simply by getting goot ETags or Expires headers, the load happens just once (!!). And the rest of your surfing experience stays snappy.

Force AJAX calls no-cache in Java. The clean way

Recently I discovered Internet Explorer caches some AJAX calls.

I was using jQuery to make some AJAX calls in a web-admin interface I’m building. I noticed none of the data changed as I tried to refresh (using an AJAX call). conclusion:: IE caches AJAX calls… very annoying.

You could go around and alter every method in your Struts/Spring/… application to force no-cache. But that would take some time. Instead, I wrote a Filter.

Hold on though, you don’t want every page to get the no-cache headers, that would seriously decrease your site performance (all pages would be force-reloaded instead of browser-cached). So we’ll only filter out AJAX calls.

Luckily, jQuery passes a header argument: X-Requested-With: XmlHttpRequest

x-requested-with header

X-Requested-With

The Code!

import java.io.IOException;
 
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
 
public class AjaxCacheFilter implements Filter{
 
	@Override
	public void destroy() {
	}
 
	@Override
	public void doFilter(ServletRequest request, ServletResponse response,FilterChain chain) throws IOException, ServletException {
		if ("XMLHttpRequest".equals(((HttpServletRequest) request).getHeader("x-requested-with"))) {
			((HttpServletResponse)response).setDateHeader("Expires", 0);
			((HttpServletResponse)response).addHeader("Cache-Control", "no-cache");
			((HttpServletResponse)response).addHeader("Pragma", "No-Cache");
 
		}
		chain.doFilter(request, response);
	}
 
	@Override
	public void init(FilterConfig arg0) throws ServletException {
	}
}

And add it as the last filter in your web.xml

<filter>
	<filter-name>ajaxCache</filter-name>
	<filter-class>
		com.yoursite.web.filters.AjaxCacheFilter
	</filter-class>
</filter>
<filter-mapping>
	<filter-name>ajaxCache</filter-name>
	<url-pattern>/*</url-pattern>
 	<dispatcher>FORWARD</dispatcher>
		<dispatcher>REQUEST</dispatcher>
</filter-mapping>

Hope it helps somebody! If you used it, let me know in comments. I’d love to know!

Executing Javascript from an external Iframe

Whats the problem?

Maybe you’ve made a deal with a partner website, or you’re just loading a page from another server. The point is: You’ve got an IFrame on your page coming from another domain.
All is well (except for Google, they don’t like iframes) untill you want some client-side interaction coming from that page.

But now you’ve reached the point where you want some client-side interaction from that IFrame. “Great!” you say. I’ll just put JS in the IFrame, and I’ll be fine. But hold your horses cowboy, there’s 2 things stopping you as a great developer from doing so:

  • Don’t have you JS scattered, have it nice and organised, centralized
  • You’re visually constrained to the IFrame

So how do we execute JS on the parent frame?

Directly from the IFrame? You can’t! Sandbox specifications say you can’t call functions defined in pages coming from another domain (kind of like loading JSON/Ajax from another domain).

But in that problem also lies the solution: just load a Proxy page from the other server!
Let me make myself a bit more clear through some Graphs:

Normally, you’de have 2 pages, page A contains an Iframe to page B:
Page A to Page B
Now we’re introducting a Proxy page, on the same server as page A. Page B contains a IFrame to the proxy page:
Page A to B to Proxy
All Done! Now you can execute JS, like so:
all done

The Code!

So how do we do it? Here you go!
index.html (on your server)

<html>
	<head>
		<script type="text/javascript">
			function alertme(str)
			{
				alert("String: " + str);
			}
		</script>
	</head>
	<body>
		<iframe src="http://yourpartner.com/iframe.htm"></iframe>
	</body>
</html>

proxy.html (on your server)

<html>
	<head>
		<script type="text/javascript">
			function gup( name )
			{
			  name = name.replace(/[\[]/,"\\\[").replace(/[\]]/,"\\\]");
			  var regexS = "[\\?&]"+name+"=([^&#]*)";
			  var regex = new RegExp( regexS );
			  var results = regex.exec( window.location.href );
			  if( results == null )
			    return "";
			  else
			    return results[1];
			}
			eval("top."+gup("execute"));
		</script>
	</head>
</html>

And finally!
iframe.html (on any other server)

<html>
        <head>
        </head>
        <body>
        Hi There!
        <iframe src="http://yourserver.com/proxy.html?execute=alertme(123);"></iframe>
        </body>
</html>

What you did there, I don’t quite see it

In the page on the other server, I pass a function call as an argument to my proxy.
My proxy then gets this function out of the parameter, and executes it through eval()!

Warning

Handle with care, allowing anyone to simply execute JS on/from your server through a parameter just opens up a whole new spectrum of XSS attacks.
That sandbox was created for a reason!

Guide to Google’s mod_pagespeed

Google recently released a specific Apache mod: mod_pagespeed.
Here are my findings:

Introduction: goal

Often, when developing websites/webapplications, you find yourself telling yourself: I’ll quickly write that piece of CSS inline or Fudge it, I’ll leave compressing that piece of JavaScript for when I get out of development phase and maybe even never mind resizing that image, I’ll just use ‘width’ and ‘height’ to get the dimensions right.
And oh how you promise yourself you’ll fix those issues later. But let’s be honest: deadlines are cruel.

So a lot of websites go live without a lot of speed omtimizations. They get a bad score in YSlow or PageSpeed.
And the word on the street is, google prefers website that load faster!

Optimizing your CSS/JS/Images takes time. Compressing your content, combining CSS/JS also makes it difficult to adjust any of these later. You’ll have to dig through compacted code. Nobody likes that.

The solution

The new mod_pagespeed is an Apache Output Filter. This means your website renders your website (in PHP, Java, Ruby, …) and just before Apache serves your HTML page to the browser, this thing comes into action.

You can configure it to do a lot of things. I listed the ones I find most important below. But you can get the full feature list here

  • Compress CSS and JS (less traffic)
  • Move inline CSS/JS to an external file (so they can get cached)
  • Combine external CSS/JS to one file (less requests)
  • Caching (of HTML,CSS,JS,images)
  • Automatically resize images based on the ‘width’ and ‘height’ attributes of an img-tag (less traffic)
  • Add ‘width’ and ‘height’ attributes if you forgot any (usability)
  • Base64 encode images and include them in HTML when small (less requests)

Bottomline: it’s genius

My Configuration

I played around with mod_pagespeed today. Here’s the configuration I came up with (I added comments)
(your configuration resides in /etc/apache2/mods-enabled/pagespeed.conf)

<IfModule pagespeed_module>
	ModPagespeed on
	AddOutputFilterByType MOD_PAGESPEED_OUTPUT_FILTER text/html
 
	ModPagespeedFileCachePath            "/var/mod_pagespeed/cache/"
	ModPagespeedGeneratedFilePrefix      "/var/mod_pagespeed/files/"
 
	ModPagespeedRewriteLevel CoreFilters
 
 
	ModPagespeedFileCachePath            "/var/mod_pagespeed/cache/"
	ModPagespeedGeneratedFilePrefix      "/var/mod_pagespeed/files/"
	#Add head section to HTML (if not already there)
	ModPagespeedEnableFilters add_head
	#Move CSS and JS to outline (external file)
	ModPagespeedEnableFilters outline_css,outline_javascript
	#If inline CSS is used, move this to head section
	ModPagespeedEnableFilters move_css_to_head
	#Combine external CSS files to 1 fil
	ModPagespeedEnableFilters combine_css
	#Compress CSS and JS by removing whitespace/comments
	ModPagespeedEnableFilters rewrite_css,rewrite_javascript
	#Cache/compress images. Couold also move the image to  HTML code (base64 encoded) if image size is small enough
	ModPagespeedEnableFilters rewrite_images
	#Add longer expires headers
	ModPagespeedEnableFilters extend_cache
	#Insert "width" and "height" attributes if not used
	ModPagespeedEnableFilters insert_img_dimensions
	#Remove HTML comments
	ModPagespeedEnableFilters remove_comments
	# Removes quotes around HTML attributes that are not lexically required
	ModPagespeedEnableFilters remove_quotes
 
	#pagespeed enabled domain 1
	ModPagespeedDomain tv.bartv.be
	#Pagespeed enabled domain 2
	ModPagespeedDomain cloudcast.bartv.be
	#pagespeed enabled domain 3
	ModPagespeedDomain notes.bartv.be
 
	#Maximum filesize cache
	ModPagespeedFileCacheSizeKb          10240
	#Interval to which cache is refreshed
	ModPagespeedFileCacheCleanIntervalMs 3600000
	ModPagespeedLRUCacheKbPerProcess     1024
	ModPagespeedLRUCacheByteLimit        16384
	#Minimum allowed bytes in CSS before exporting to external CSS
	ModPagespeedCssOutlineMinBytes       1000
	#Minimum allowed bytes in JS before exporting to external JS
	ModPagespeedJsOutlineMinBytes        3000
        #Maximum allowed filesize for embedding images inline (base64)
        ModPagespeedImgInlineMaxBytes        2048
 
</IfModule>

Carefull with writing permissions

On my webserver, I run every website under a different user account (e.g. mycoolwebsite.bartv.be is run by user ‘coolwebsite’ or something). They all belong to the group ‘www-data’.
Now one consequence of this is that the cache being written by mod_pagespeed gets written by the user running the website. So be carefull with writing permissions in /var/mod_pagespeed/

Footnote

Even though this article might suggest it. I’m in no way promoting quick and dirty development.
You could think: to hell with it, I’ll just write whatever I want, mod_pagespeed will solve it for me. Don’t

mod_pagespeed is great, but in my eyes, it helps you find errors you might have missed. And it lets you use your development CSS/JS files without compressing them first.
But don’t use it to clean up your mess!