The robots.txt file is a form of communication between visiting robots (spiders) that index the content of your web site pages. Every search engine has a spider, Google has one and so do Yahoo!, Msn and Ask. A well written robots.txt file will improve your chances of ranking in the search engines, if it’s written properly.
Location, Location, Location
The robots.txt file resides in your root directory, for instance mine is at http://BonsaiJon.com/robots.txt it is just a text file, you can create it in Notepad and then upload it to your server via FTP.
Why Should I use a robots.txt file?
It’s important to create a robots.txt file so you can tell the bots what they can crawl and what they shouldn’t. This may come as a surprise to some of you but not all files on your website should be crawled. Some shouldn’t be crawled for security and some for search engine optimization and common sense.
For instance if you have a Content Management System which you access from your /admin/ folder, you don’t want that appearing in the search engines right? Google may crawl any website you access, especially if you have the Google Toolbar Installed.
How do I do it and How Difficult is It?
It’s simpler than you think. There are 2 commands that are most frequently used in a robots.txt file which you should get familiar with, User-agent: and Disallow:
I will explain what each one means, and then we’ll skip to some examples.
One thing worth saying about commands is, in your robots file, use only 1 command per line.
What is a User-agent?
Simple. Each bot that comes to visit your site has a name, that is the user-agent. Google’s bot is called Googlebot, MSN’s bot is called msnbot and Yahoo’s bot is called Slurp.
Therefore if I want to communicate with any of the bots in particular I just type the user-agent command followed by the bot name
Ex. User-agent: Googlebot or User-agent: Slurp.
You can also pass a command to all bots at once by using the Asterisk (*) Wild Card.
Ex. User-agent: *
What should I tell the robot and Why?
As you might have guessed, there is no point in shouting someone’s name and then be quiet, they’ll just think you’re not ok in the head. So let’s communicate with the bots.
Example 1. If we don’t want Google to Crawl our admin folder we would have the 2 lines below in our robots.txt file.
User-agent: Googlebot
Disallow: /admin/
Example 2. If you are using Wordpress and we don’t want any bots to spider our wp-admin folder we would put the following 2 lines in our robots file:
User-agent: *
Disallow: /wp-admin/
Example 3. You can have more than one folder that shouldn’t be spidered, like your image folder, your rss feeds folder and a file called home2.php. The robots.txt in that scenario would be:
User-agent: GoogleBot
Disallow: /feeds/
Disallow: /images/
Disallow: /home2.php
Example 4. I don’t want Google To Crawl ANY of my pages. (This is only recommended for development servers)
User-agent: Googlebot
Disallow:/
Example 5. Disallowing files by extension. Let’s say we want to exclude all JavaScript .js files and Style Sheets .css files.
User-agent: Googlebot
Disallow: /*.js$
Disallow: /*.css$
The * is a wild card and means “any character, space or number” and $ means “Ending With”
So Where does Search Engine Optimisation get into it?
SEO is all about getting rewarded for making the life of the Search Engines easier to find the relative content. By telling them not to crawl irrelevant content, we are essentially making them more efficient (not wasting their time and bandwidth). We will also be helping ourselves by not having duplicate content, especially when it comes to blogs.
Here are some tips of what you shouldn’t let bots crawl.
- Images
There is no major benefit of letting Google index your images. You might argue that it’s good to have an image from your site show up first page in a search on Google Images. Ask yourself this, when was the last time you SOLD a product because someone was searching for an image in Google Images? My guess is either very low or most likely 0.
[Update 2010]Disallowing images doesn’t work with Google anymore, your robots file is simply ignored. I have tested this on multiple websites with no success.
- RSS and Atom Feeds
Here’s the dealio. You have optimized your site and made sure that irrelevant outbound links have the rel=nofollow attribute. You publish the document it goes into your RSS Feed and what happens?
RSS Feeds are not capable of handling nofollow attributes. Feeds will also appear as duplicate content. If you have a look at my index page http://BonsaiJon.com and http://bonsaijon.com/feed/ you will notice a lot of similar content right? That’s wrong for SEO, - Style Sheets and JavaScript
There is no logical reason why you should let any bot crawl these. Save your bandwidth. - Wordpress Duplicate Content
We can eliminate loads of duplicate content from Wordpress by disallowing the /page/, /tag/, /category/. For Wordpress I also recommend disallowing /trackbacks/.
Example of the above mentioned put together.
User-agent: *
Disallow: /tag/
Disallow: /feed/
Disallow: /archives/
Disallow: /category/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /category/
Disallow: /page/
Disallow: /pages/
Disallow: /feed/
Disallow: /feed
How Can I Be Sure I Did It Right?
Easy, you can do this by using this Robots Text file Validator.


{ 1 trackback }
{ 5 comments… read them below or add one }
As always Jon, an impeccable article!
Great Post!
‘Style Sheets and JavaScript’ was surprising for me but it’s true!
Thanks for the interesting post!
Very helpful and well explained. But is there not a duplicate in there? - Disallow: /category/
Dont agree with blocking the images - nonsense - it is like playing a lottery, why not to play and let them find you in Images index?
So outdated about the RSS feeds - so when u submit a XML sitemap - this is also a duplicate content?
If u blocking CSS and JS - how Google can know that u are a legitimate guidelines obeying webmaster? they will dl these only once if u dont change them…
anyway - have a good one!
good site