Blog

Page-level caching with Nginx

Posted by on the 6th April 2008 @ 8:51pm

In a further attempt to modify my websites so that they can withstand the Digg Effect, I have looked into getting Nginx, a lightweight http server, to perform page-level caching.

Nginx can act as a reverse proxy, sending any HTTP request sent to it to another web server. It can also store the response to file, which can later be served on future requests.

So I installed Nginx:

cd /usr/src
wget http://sysoev.ru/nginx/nginx-0.6.29.tar.gz
tar zxvf nginx-0.6.29.tar.gz

cd nginx-0.6.29
./configure --with-http_gzip_static_module --with-http_ssl_module
make
make install

Next, I moved Apache over to another port, e.g. 8080. To do this I modified the Listen and NameVirtualHost directives to this port, and modified the VirtualHosts to this. I also added to the following, so that the X-Real-IP header nginx will send to Apache will be stored for scripts:

SetEnvIf X-Real-IP "^(.*)$" REMOTE_HOST=$1 REMOTE_ADDR=$1

I edited /usr/local/nginx/conf/nginx.conf, and commented out the default server, and added:

include vhosts/*.conf

This makes it so you can specify multiple nginx server configurations in different files.

I also set Nginx's user to apache, so that apache can also write to the cache files.

I made the /usr/local/nginx/vhosts directory, and created a new conf:

server {
    listen 80;
    server_name www.example.com;

    # if the request uri was a directory, store the index page name
    if ($request_uri ~ /$) {
        set $store_extra index.html;
    }

    # proxy module defaults
    proxy_store_access   user:rw  group:rw  all:r;
    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  Host       $host;

    # if a precompiled gzip of the file exists, use it and force http proxies
    # to use separate cache's based on User-Agent
    gzip_vary on;
    gzip_static on;

    location / {
        root /var/www/${host}/cache;
        index index.html;

        # set the location the proxy will store the data to. Add the index page
        # name if the uri was a directory (nginx can't normally store these)
        proxy_store $document_root${request_uri}${store_extra};

        # go through the proxy if there is no cache
        if (!-f $document_root${request_uri}${store_extra}) {
            proxy_pass http://localhost:8080;
        }

        # workaround. headers module doesn't take into account proxy response
        # headers. It overwrites the proxy Cache-Control header, causing
        # private/no-cache/no-store to be wiped, so only set if not using proxy
        if (-f $document_root${request_uri}${store_extra}) {
            expires 0;
        }
    }

    # don't cache admin folder, send all requests through the proxy
    location /admin {
        proxy_pass http://localhost:8080;
    }

    # handle static files directly. Set their expiry time to max, so they'll
    # always use the browser cache after first request
    location ~* (css|js|png|jpe?g|gif|ico)$ {
        root /var/www/${host}/http;
        expires max;
    }
}

Next I made the cache directory, and set the ownership so that nginx could write to it, and started up Nginx:

mkdir /var/www/www.example.com/cache
chown apache /var/www/www.example.com/cache

/usr/local/nginx/sbin/nginx

All was working fine until I tested some pages that I didn't want caching. PHP had set the Cache-Control headers so that proxies don't cache/store them (private, no-cache, no-store), but Nginx was caching them anyway. I modified Nginx's source so that it wouldn't do this. The source can be found at the end of this blog entry.

Next, to address the Internet Explorer 6 bug, which doesn't cache pages which use HTTP header "Vary: Accept-Encoding", I modified the code to use "Vary: User-Agent", which I have also included a patch below.

Now I have a fully static website, which will work with the minimum of resources, and will work even if Apache or MySQL go down. All thats left is:

  • Store extra proxy response headers, such as Content-Type, to be played back. My sites don't in fact use file extensions for dynamic content, so Nginx can't match up the file to its associated mime type. At the moment I'm forcing the default type to text/html.
  • Invalidate the appropriate parts of the cache when the dynamic content has changed, otherwise I have to manually delete the cache files to get updates.
  • Store the entire site in the cache. At the moment, Nginx is only using the cache if it has already stored the proxy response.
  • Store a gzipped version of the cache for browsers that support gzipped content.

The patch files I have created are for the latest version (0.6.29). To patch your copy do:

cd /usr/src/nginx-0.6.29
patch -p0 < /path/to/patch.diff

Zend_Cache_Frontend_Page - FAIL

Posted by on the 6th April 2008 @ 3:14pm

In my attempts to create the most Digg resilient website, I decided there was only one thing for it, full page-based caching on the server.

My first foray into this was trying to implement Zend's Zend_Cache_Frontend_Page.

However, I found a few show-stopping bugs in this, when involving browser-based cache HTTP status (304 Not Modified), and for http redirects. The problem is, the frontend caches the page whatever the HTTP status is sent. For instance, notifying to the browser:

header('HTTP/1.1 304 Not Modified');
exit;

If this was run while the cache was invalidated, the cache would store a blank page. Any requests after this would serve HTTP/1.1 200 OK with a blank page.

It seems that the cache should ideally only cache the page on HTTP status 200 OK, otherwise the example above, and any other status' (stati?) will be cached.

So I had a look into fixing the bug. It turns out, as far as I know, there is no way to retrieve the sent HTTP status (headers_list doesn't appear to return it). A fix would have to involve an addition to the PHP binary.

Alternatively, there would have to be another way to invoke the HTTP status, whilst storing it in a PHP variable for future lookup.

Having said that, Zend_Cache is a very nice and easy to use system for caching data through abstracted backends. Page caching is only one of their frontends.