The first step in our war against the Spiders is to identify them. There are many techniques to find out who the bad bots are, from manually searching your access_logs to using a maintained list and picking which ones you want to exclude.
At the end of the day it's getting the robots name - its User-Agent - that's important, not how you get it. That said, here's a method I like that targets the worst offenders.
Add a line like this to your robots.txt file
Disallow: /email-addresses/
where 'email-addresses' is not a real directory. Wait a decent amount of time (a week to a month) then go through your access_log file and pick out the User-Agent strings that accessed the /email-addresses/ directory.
# For Googlebot 10 is time in seconds between page requests
User-agent: Googlebot
Crawl-delay: 10
Disallow:
User-agent: MSNBot
Crawl-delay: 10
Disallow:
User-agent: Slurp
Crawl-delay: 10
Disallow:
User-agent: Teoma
Crawl-delay: 10
Disallow:
User-agent: Gigabot
Crawl-delay: 10
Disallow:
User-agent: Scrubby
Crawl-delay: 10
Disallow:
User-agent: Robozilla
Crawl-delay: 10
Disallow:
User-agent: KBroker
Crawl-delay: 10
Disallow:
User-agent: Ultraseek
Crawl-delay: 10
Disallow:
User-agent: *
Crawl-delay: 10
Disallow: /
# For All Other 20 (time in seconds between page requests)
User-agent: *
Crawl-delay: 20
# Disallow folders
Disallow: /email-addresses/
Disallow: /mydata/
# Disallow file
Disallow: /test.php
Disallow: /test.htm
Links:
The Web Robots Pages
Robots Database.
Previous page: htaccess Disable Browser To Prompt Open/Save As Option
Next page: HTML input Tag