What is a robots.txt File?

Sometimes we need to let search engine robots know that certain information should not be retrieved and stored by them. One of the most common methods for defining which information is to be "excluded" is by using the "Robot Exclusion Protocol." Most of the search engines conform to using this protocol. Furthermore, it is possible to send these instructions to specific engines, while allowing alternate engines to crawl the same elements.

Should you have material which you feel should not appear in search engines (such as .cgi files or images), you can instruct spiders to stay clear of such files by deploying a "robots.txt" file, which must be located in your "root directory" and be of the correct syntax. Robots are said to "exclude" files defined in this file.

Using this protocol on your website is very easy and only calls for the creation of a single file which is called "robots.txt". This file is a simple text formatted file and it should be located in the root directory of your website.

So, how do we define what files should not be crawled by search engines? We use the "Disallow" statement!

Create a plain text file in a text editor e.g. Notepad / WordPad and save this file in your "home / root directory" with the filename "robots.txt".

The URL for your robots.txt file should be:
http://www.yoursite.com/robots.txt

This file will now become your index of files that may not be crawled by spiders. Let's say for example you have a file called "filename.html" on your website which you'd rather did not appear in search engines. You may instruct search engines to stay away from this file by adding the following line to your "robots.txt" file.

User-agent: *
Disallow: /filename.html

Now, let's say you have 2 files which you wish to exclude, "filename1.html" and "filename2.jpg". You can use the following:

User-agent: *
Disallow: /filename1.html
Disallow: /filename2.jpg

Furthermore, you can choose to block entire directories by appending a "trailing slash" to the folder name. The following line will tell ALL robots to exclude ALL files located in the "directoryname" folder, which simultaneously excluding the aforementioned files:

User-agent: *
Disallow: /filename.html
Disallow: /filename1.html
Disallow: /filename2.jpg
Disallow: /directoryname/

Instructing Specific Engines
Should you wish to instruct only specific engines to exclude certain files, you can do so by specifying the "User Agent" of the robot in question. The "User Agent" value will vary by spider / robot. Examples are "Googlebot", which is the User Agent used by Google, and "Slurp", which is the identifying User Agent of Inktomi. Here is an example which will force Google ONLY to exclude all aforementioned files and directories, while instructing Inktomi to exclude 2 separate files names "slurp.html" and "imac.jpg":

User-agent: Googlebot
Disallow: /filename.html
Disallow: /filename1.html
Disallow: /filename2.jpg
Disallow: /directoryname/

User-agent: Slurp
Disallow: /slurp.html
Disallow: /imac.jpg

Important note
There are several important issues concerning the use of the "Robots Exclusion Protocol". Firstly, the "robots.txt" filename is case sensitive, and must not contain uppercase letters. All filenames, User Agents and directory names are case sensitive. This exclusion protocol does not apply to all robots & spiders. Lastly, You should be made aware that your robots.txt file will be visible to everybody and therefore no sensitive information should be specified in it. You should also be made aware that while legitimate robots generally do adhere to the Robot Exclusion Protocol, there is technically nothing to prevent them from looking at the files listed. Some hackers actually look at this file to see if there are any links to administration areas or database files, so do not list anything sensitive here unless it is also password protected through other means.

Robots.TXT Creators

http://www.yellowpipe.com/yis/tools/robots.txt/ http://tools.seobook.com/robots-txt/generator/ http://www.internetmarketingninjas.com/seo-tools/robots-txt-generator/ http://www.1pagedesign.com/robots.txt_generator/