Using Apache to Stop bad Robots and Spambots

The first step in our war against the Spiders is to identify them. There are many techniques to find out who the bad bots are, from manually searching your access_logs to using a maintained list and picking which ones you want to exclude.

At the end of the day it's getting the robots name - its User-Agent - that's important, not how you get it. That said, here's a method I like that targets the worst offenders.

Add a line like this to your robots.txt file

Disallow: /email-addresses/



where 'email-addresses' is not a real directory. Wait a decent amount of time (a week to a month) then go through your access_log file and pick out the User-Agent strings that accessed the /email-addresses/ directory.

# For Googlebot 10 is time in seconds between page requests
User-agent: Googlebot
Crawl-delay: 10
Disallow:

User-agent: MSNBot
Crawl-delay: 10
Disallow:

User-agent: Slurp
Crawl-delay: 10
Disallow:

User-agent: Teoma
Crawl-delay: 10
Disallow:

User-agent: Gigabot
Crawl-delay: 10
Disallow:

User-agent: Scrubby
Crawl-delay: 10
Disallow:

User-agent: Robozilla
Crawl-delay: 10
Disallow:

User-agent: KBroker
Crawl-delay: 10
Disallow:

User-agent: Ultraseek
Crawl-delay: 10
Disallow:

User-agent: *
Crawl-delay: 10
Disallow: /

# For All Other 20 (time in seconds between page requests)

User-agent: *
Crawl-delay: 20

# Disallow folders
Disallow: /email-addresses/
Disallow: /mydata/

# Disallow file
Disallow: /test.php
Disallow: /test.htm


Links:
The Web Robots Pages
Robots Database.