Using the robots.txt file

The robots.txt file is a set of instructions for visiting robots (spiders) that index the content of your web site pages. For those spiders that obey the file, it provides a map for what they can, and cannot index. The file must be in the root directory of your web. The URL path of your robots.txt file should look like this...

/robots.txt


In a nutshell, when a Robot vists a Web site, say http://www.example.com/, it firsts checks for http://www.example.com/robots.txt. If it can find this document, it will analyse its contents for records like:


Definition of the above robots.txt file:

User-agent: *


The asterisk (*) or wildcard represents a special value and means any robot.

Disallow:


The Disallow: line without a / (forward slash) tells the robots that they can index the entire site.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record without the / (forward slash) as shown above.

The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.


The Disallow: line without the trailing slash (/) tells all robots to index everything. If you have a line that looks like this:

Disallow: /private/

It tells the robot that it cannot index the contents of that /private/ directory.


Summarizing the Robots Exclusion Protocol - robots.txt file

To allow all robots complete access:

User-agent: *
Disallow:


To exclude all robots from the server:

User-agent: *
Disallow: /


 To exclude all robots from parts of a server:

User-agent: *
Disallow: /private/
Disallow: /images-upload/
Disallow: /test/


To exclude a single robot from the server:

User-agent: Named Bot
Disallow: /


To exclude a single robot from parts of a server:

User-agent: Named Bot
Disallow: /private/
Disallow: /images-upload/
Disallow: /test/
Disallow: /private.html
Disallow: /email.html
Disallow: /error.html

 

Example of robots.txt

# robots.txt for http://www.yoursite.com/
# Last modified: 1999-12-22T08:00:00-0600

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
Disallow: /temp/
Disallow: /myfolder/

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /



Check also :
Using Apache to Stop bad Robots and Spambots
The Robots META tags

More Links:
The Web Robots Pages
Robots Database

Robot Control Code Generation Tool