Blog

Zend_Form's flaw in an MVC environment

Posted by on the 19th April 2008 @ 3:51pm

Zend_Form, the latest and greatest addition to the Zend Framework in version 1.5, is an infusion of the best bits of Zend_Filter_Input, and the Zend_View_Helper system. At first glance it looks like the ideal system for setting up from the simplest to the most complex forms, and this is how most people will see it.

I recently delved into the use of it, and was instantly shocked at a fatal flaw in its design. Where am I supposed to put it?

I try to code in as best practices as I can, and so even when I'm not using the Zend MVC implementation, I'm still trying to keep to MVC style principles. Essentially I always try to separate control logic from view layout.

Zend_Form is about as bad as it gets at separating this. Initialisation involves specifying Zend_Form_Element's which directly map to their associated Zend_View_Helper's.

// controller
$form = new Zend_Form();
$form
    ->setAction('/submit')
    ->setMethod('post')
    ->addElement('text', 'user-name', array(
        'required' => true,
        'default' => 'Fred',
        'validators' => array(
            array('stringLength', 1, 20),
        )
    ))
    ->addElement('textarea', 'user-comment', array(
        'required' => true,
        'validators' => array(
            array('stringLength', 1, 255),
        )
    ))
    ->addElement('submit', 'action', array(
        'label' => 'Submit',
    ));

if ($_POST['action']) {
    if ($form->isValid()) {
        // insert comment
    }
}

// view
echo $form->render();

This snipett of code will create a simple form with a textbox and textarea in. If submitted, the form will be validated, and the comment will be added if it passes.

The assumption is that you put the form initialisation in the controller, as that is where validation and database insertation should be performed. However, that means the controller has to define the HTML tag type of each element. e.g. in this example, that user-name is an <input type="text" />, and user-comment is a <textarea></textarea>.

This view layout information, not control logic. The only place this should be put is in the view script, however the above is the only way, hence this control is not suited to MVC practices.

Now considering the Zend Framework trys to keep to MVC principles, why is it that Zend_Form completely disregards this? I would assume that this may have been due to lack of communication between teams.

My first thought was, how can this be done better MVC style? The trouble is it may be possible that a display element would be complex, and that it needs control logic associated with it. An element may be a scalar, or contain multiple predefined options.

So, this could be represented by declaring elements as primitive types, such as:

  • string
  • integer
  • enum
  • set

There could also be more complex types which define a bit more control behaviour:

  • date
These types have several obvious HTML element mappings, e.g. strings map to <input type="text" /> or <textarea></textarea>, enums map to <select></select> or multiple <input type="radio" />. A developer doesn't always need to care about enforcing a specific type, so the Form control could choose for itself.

An example of this in use would be:

// controller
$form = new My_Form();
$form
    ->setAction('/submit')
    ->setMethod('post')
    ->addElement('string', 'user-name', array(
        'required' => true,
        'validators' => array(
            array('stringLength', 1, 20),
        ))
    ->addElement('string', 'user-comment', array(
        'required' => true,
        'validators' => array(
            array('stringLength', 1, 255),
        ));

if ($form->posted()) {
    if ($form->isValid()) {
        // insert comment
    }
}

// view
$form->setElementHelper('string', 'text'); // already set as default
$form->addElement('submit', 'action', array(
        'label' => 'Submit',
    ));
$form->getElement('user-comment')->setHelper('textarea');
echo $form->render();

This would create exactly the same form, but conform to MVC standards.

Page-level caching with Nginx

Posted by on the 6th April 2008 @ 8:51pm

In a further attempt to modify my websites so that they can withstand the Digg Effect, I have looked into getting Nginx, a lightweight http server, to perform page-level caching.

Nginx can act as a reverse proxy, sending any HTTP request sent to it to another web server. It can also store the response to file, which can later be served on future requests.

So I installed Nginx:

cd /usr/src
wget http://sysoev.ru/nginx/nginx-0.6.29.tar.gz
tar zxvf nginx-0.6.29.tar.gz

cd nginx-0.6.29
./configure --with-http_gzip_static_module --with-http_ssl_module
make
make install

Next, I moved Apache over to another port, e.g. 8080. To do this I modified the Listen and NameVirtualHost directives to this port, and modified the VirtualHosts to this. I also added to the following, so that the X-Real-IP header nginx will send to Apache will be stored for scripts:

SetEnvIf X-Real-IP "^(.*)$" REMOTE_HOST=$1 REMOTE_ADDR=$1

I edited /usr/local/nginx/conf/nginx.conf, and commented out the default server, and added:

include vhosts/*.conf

This makes it so you can specify multiple nginx server configurations in different files.

I also set Nginx's user to apache, so that apache can also write to the cache files.

I made the /usr/local/nginx/vhosts directory, and created a new conf:

server {
    listen 80;
    server_name www.example.com;

    # if the request uri was a directory, store the index page name
    if ($request_uri ~ /$) {
        set $store_extra index.html;
    }

    # proxy module defaults
    proxy_store_access   user:rw  group:rw  all:r;
    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  Host       $host;

    # if a precompiled gzip of the file exists, use it and force http proxies
    # to use separate cache's based on User-Agent
    gzip_vary on;
    gzip_static on;

    location / {
        root /var/www/${host}/cache;
        index index.html;

        # set the location the proxy will store the data to. Add the index page
        # name if the uri was a directory (nginx can't normally store these)
        proxy_store $document_root${request_uri}${store_extra};

        # go through the proxy if there is no cache
        if (!-f $document_root${request_uri}${store_extra}) {
            proxy_pass http://localhost:8080;
        }

        # workaround. headers module doesn't take into account proxy response
        # headers. It overwrites the proxy Cache-Control header, causing
        # private/no-cache/no-store to be wiped, so only set if not using proxy
        if (-f $document_root${request_uri}${store_extra}) {
            expires 0;
        }
    }

    # don't cache admin folder, send all requests through the proxy
    location /admin {
        proxy_pass http://localhost:8080;
    }

    # handle static files directly. Set their expiry time to max, so they'll
    # always use the browser cache after first request
    location ~* (css|js|png|jpe?g|gif|ico)$ {
        root /var/www/${host}/http;
        expires max;
    }
}

Next I made the cache directory, and set the ownership so that nginx could write to it, and started up Nginx:

mkdir /var/www/www.example.com/cache
chown apache /var/www/www.example.com/cache

/usr/local/nginx/sbin/nginx

All was working fine until I tested some pages that I didn't want caching. PHP had set the Cache-Control headers so that proxies don't cache/store them (private, no-cache, no-store), but Nginx was caching them anyway. I modified Nginx's source so that it wouldn't do this. The source can be found at the end of this blog entry.

Next, to address the Internet Explorer 6 bug, which doesn't cache pages which use HTTP header "Vary: Accept-Encoding", I modified the code to use "Vary: User-Agent", which I have also included a patch below.

Now I have a fully static website, which will work with the minimum of resources, and will work even if Apache or MySQL go down. All thats left is:

  • Store extra proxy response headers, such as Content-Type, to be played back. My sites don't in fact use file extensions for dynamic content, so Nginx can't match up the file to its associated mime type. At the moment I'm forcing the default type to text/html.
  • Invalidate the appropriate parts of the cache when the dynamic content has changed, otherwise I have to manually delete the cache files to get updates.
  • Store the entire site in the cache. At the moment, Nginx is only using the cache if it has already stored the proxy response.
  • Store a gzipped version of the cache for browsers that support gzipped content.

The patch files I have created are for the latest version (0.6.29). To patch your copy do:

cd /usr/src/nginx-0.6.29
patch -p0 < /path/to/patch.diff

Zend_Cache_Frontend_Page - FAIL

Posted by on the 6th April 2008 @ 3:14pm

In my attempts to create the most Digg resilient website, I decided there was only one thing for it, full page-based caching on the server.

My first foray into this was trying to implement Zend's Zend_Cache_Frontend_Page.

However, I found a few show-stopping bugs in this, when involving browser-based cache HTTP status (304 Not Modified), and for http redirects. The problem is, the frontend caches the page whatever the HTTP status is sent. For instance, notifying to the browser:

header('HTTP/1.1 304 Not Modified');
exit;

If this was run while the cache was invalidated, the cache would store a blank page. Any requests after this would serve HTTP/1.1 200 OK with a blank page.

It seems that the cache should ideally only cache the page on HTTP status 200 OK, otherwise the example above, and any other status' (stati?) will be cached.

So I had a look into fixing the bug. It turns out, as far as I know, there is no way to retrieve the sent HTTP status (headers_list doesn't appear to return it). A fix would have to involve an addition to the PHP binary.

Alternatively, there would have to be another way to invoke the HTTP status, whilst storing it in a PHP variable for future lookup.

Having said that, Zend_Cache is a very nice and easy to use system for caching data through abstracted backends. Page caching is only one of their frontends.

Hierarchical data in MySQL using nested sets

Posted by on the 1st April 2008 @ 9:39pm

As you may know, MySQL is a relational database system. It consists of flat tables, which can be joined together in queries. Relations between these tables can only be specified in a way that is one-to-one/one-to-many.

This suits most situations, but when you start getting to hierarchical data, such as multiple level categories (as used on this site), these types of relations start to become non-optimal.

Take for instance the adjacency list method. Each entry in the table is given a column with the id of its parent node:

id name parent_id
1 parent 0
2 child 1 1
3 sub-child 1 2

In order to find all descendants of a node, there would need to be a query once for the parent, then once per node:

function getNodes($db, $parentId) {
    $sql = 'SELECT * FROM nodes WHERE parent_id = ?';
    $nodes = array()
    foreach ($db->fetchAll($sql, array($parentId), Zend_Db::FETCH_OBJ) as $node) {
        $node->subnodes = getNodes($db, $node->id);
        $nodes[] = $node;
    }
    return $nodes;
}

An alternative method is to use nested sets. This takes advantage of MySQL's index range selection by using two columns to specify a range in which children exist:

id name left right
1 parent 1 6
2 child 1 2 5
3 subchild 1 3 4

In this case all descendants of a node can be found between the parent node's left and right columns.

Combining this with the adjacent list method, you can optimise the conversion of a MySQL result set into a multi-level array of objects:

function getNodes($db, $parentId) {
    $root = (object) array(
        'id' => $parentId,
    );

    $stack = array($root);

    $sql = 'SELECT  nodes.*
FROM nodes, nodes AS parent
WHERE nodes.left > parent.left AND nodes.left < parent.right
AND parent.id = ?
ORDER BY nodes.left';

    foreach ($db->fetchAll($sql, array($parentId), Zend_Db::FETCH_OBJ) as $node) {

        while ($node->parent_id != $stack[count($stack)-1]->id) {
            array_pop($stack);
        }

        $stack[count($stack)-1]->subnodes[] = $node;
        array_push($stack, $node);
    }
    return $root->subnodes;
}

Of course this method is at the expense of insert/modification of nodes in the table, which would need more code to do, but usually these happen much less often than selects in production environments.

Googlebot going crazy over Trac

Posted by on the 15th March 2008 @ 11:30pm

I was having a glance at the websites stats (I'm using Awstats to compile apache access logs), and noticed that Google had been hammering the site in the last couple of days:

Googlebot 2360+3 44.22 MB 18 Mar 2008 - 22:35

Needless to say I was a bit shocked to see that kind of activity considering I don't have much up on the site yet. I quickly found the source of the problem.

Apparently google had been crawling my entire svn repository, which was fair enough. It should only really be able to index the latest version.

It was also crawling my project area. The problem with this was that Trac's svn browser also gave access to the svn repository, and that it was also crawling revisions, sort orders, annotations, the whole works.

66.249.72.105 - - [18/Mar/2008:22:32:28 +0000] "GET /projects/framework/browser/trunk/config/optimizer.defaults.ini?rev=9 HTTP/1.1" 200 10481 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.105 - - [18/Mar/2008:22:35:16 +0000] "GET /projects/framework/browser/trunk/library/Webtatic/Optimizer/Plugin/File.php?annotate=1&rev=2 HTTP/1.1" 200 49997 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Obviously, I should have put in a robots.txt rule to block search engines from indexing that part of the site. I have rectified that now with the following in /robots.txt

User-agent: *
Disallow: /projects/framework/browser/

No doubt Google would penalise a site if this isn't done, as it would pick up a lot of what it would think of as duplicate content.  

Not just another WordPress blog

Posted by on the 15th March 2008 @ 7:12pm

... its been written by me instead.

Welcome to Webtatic.com, my new technical blog and site for [L]GPL projects.

I currently have a backlog of posts which I've been planning to write, mostly on the subject of web development, which I will try to post here.

Some of the projects I have in store are:

  • Webtatic Optimizer - a set of extensible scripts, which can optimise static files, such as Javascript and CSS, combining, and compressing them to lower bandwidth and speed up website load times.
  • Webtatic Spider - a library for data mining sites using as minimal resources and time as possible, using asynchronous and pipelined TCP connections.