Tips, Tricks, Tools & Techniques

for Internet Business, Life, the Universe and Everything

RSS Feed



robots.txt

13 August, 2007 (22:02) | Search Engines, Security | By: Nick Dalton

Back in the days around 3 B.G (Before Google) AltaVista was the new search engine on the block. In an effort to show off the power of their minicomputers, the AltaVista team at Digital decided to crawl and index the entire web. This was at the time a new concept. Many web masters didn’t relish the idea of a “robot” program accessing every page on their web site as this would add more load to their web servers and increase their bandwidth costs. So in 1996 the Robots Exclusion Standard was created to address these web master concerns.

Using a simple text file called robots.txt you can instruct web crawlers (a.k.a. robots) to stay out of certain directories. Here is a very simple robots.txt which disallows all robots (User-agents) access to the /images directory.

User-agent: *
Disallow: /images

By disallowing /images you are also implicitly disallowing all subdirectories under /images, such as /images/logos and any files beginning with /images such as /images.html.

Curiously there was no “Allow” directive in the first draft of the standard. It was added later, but it’s not guaranteed to be supported by all robots. So anything that is not specifically disallowed should be considered fair game for web crawlers.

To disallow access to your entire web site use a robots.txt like this:

User-agent: *
Disallow: /

If User-agent is * then the following lines apply to all search engine robots. By specifying the signature of a web crawler as the User-agent you can give specific instructions to that robot.

User-agent: Googlebot
Disallow: /google-secrets

Since the original spec was published several search engines have extended the protocol. One popular extension is to allow wildcards.

User-agent: Slurp
Disallow: /*.gif$

This prevents Yahoo! (whose web crawler is called Slurp) from indexing any files on your site that end with “.gif”. Keep in mind that wildcard matches are not supported by all search engines so you have to preface these lines with the appropriate User-agent line.

You can combine several of the above techniques in one robots.txt file. Here’s a theoretical example.

User-agent: *
Disallow: /bar
User-agent: Googlebot
Allow: /foo
Disallow: /bar
Disallow: /*.gif$
Disallow: /

This would result in the following access results for a few URLs:

URL Googlebot Other robots
example.com/foo.html Allowed Allowed
example.com/food.html Allowed Allowed
example.com/foo/ Allowed Allowed
example.com/foo/index.html Allowed Allowed
example.com/foo.gif Allowed Allowed
example.com/fu.html Blocked Allowed
example.com/bar.html Blocked Blocked
example.com/bar/index.html Blocked Blocked
example.com/img.gif Blocked Allowed

Computer programs are pretty good at following instructions like these. But for a human brain it can quickly get overwhelming, so I highly encourage you to keep it simple. One of the longer robots.txt files I’ve encountered is from www.seobook.com – it’s over 300 lines long. The site owner Aaron Wall is the author of the excellent SEO Book; he knows what he’s doing.

For us mortals there is a robots.txt analysis tool in Google’s webmaster tools. Highly recommended. Another good resource for more information on the Robots Exclusion Standard is www.robotstxt.org

Today when companies are spending a lot of money to be included in search engine listings, the idea of excluding your content may seem quaint. But from a security perspective there are many valid reasons for limiting what a search engine indexes on your site. See my Digital Security Report for more information.

Related posts:
No related posts

Write a comment