Blog

Googlebot going crazy over Trac

Posted by on the 15th March 2008 @ 11:30pm

I was having a glance at the websites stats (I'm using Awstats to compile apache access logs), and noticed that Google had been hammering the site in the last couple of days:

Googlebot 2360+3 44.22 MB 18 Mar 2008 - 22:35

Needless to say I was a bit shocked to see that kind of activity considering I don't have much up on the site yet. I quickly found the source of the problem.

Apparently google had been crawling my entire svn repository, which was fair enough. It should only really be able to index the latest version.

It was also crawling my project area. The problem with this was that Trac's svn browser also gave access to the svn repository, and that it was also crawling revisions, sort orders, annotations, the whole works.

66.249.72.105 - - [18/Mar/2008:22:32:28 +0000] "GET /projects/framework/browser/trunk/config/optimizer.defaults.ini?rev=9 HTTP/1.1" 200 10481 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.105 - - [18/Mar/2008:22:35:16 +0000] "GET /projects/framework/browser/trunk/library/Webtatic/Optimizer/Plugin/File.php?annotate=1&rev=2 HTTP/1.1" 200 49997 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Obviously, I should have put in a robots.txt rule to block search engines from indexing that part of the site. I have rectified that now with the following in /robots.txt

User-agent: *
Disallow: /projects/framework/browser/

No doubt Google would penalise a site if this isn't done, as it would pick up a lot of what it would think of as duplicate content.