Block spam harvesters

On my site, I have a few “special” pages who’s only purpose is to ban bots that ignore robots.txt (or worse, use it as a hint for where “the good stuff” is!). Here’s how I do that:

robots.txt is a file that can be used to tell bots and search engines what pages are permitted to them, and what is forbidden. Some bots, however, do not bother checking this file first. Worse than that, others have been known to specifically hit the Disallow pages, on the assumption that anything blocked must have good content! (Of course, this is false– anything private will usually be secured by authentication, but since when has logic stopped a spammer?).

On my site, I have several hidden pages. In order to prevent well behaved bots from being blocked, they’re all listed in robots.txt as

User-agent: *
Disallow: /path/to/page

I have a special page there that actually blocks anyone who accesses it. The first page is at /cgi-bin/guestbook.cgi, but has no hyperlinks pointing. This is to catch spam bots that are hard coded to look for a common location for an exploitable script. I also have a link on my main page (with no link text, so a user won’t see it) to /site/S.P.A.M.T.R.A.P/. Since a user using a screen reader could potentially hit that page, there’s a doorway page there that warns the user not to go any further. Beyond there are two more links to other files (symlinked to the original .cgi) which will block an IP.
So how does it actually work?
The page consists of a perl script that adds a line to the .htaccess file which says the following: SetEnvIf Remote_Addr ^<IP ADDRESS>$ denied for each blocked address.

I also have

<Files *>
order deny,allow
deny from env=denied
allow from env=allowed

In my .htaccess file, which tells it to actually use the environment variables!

The actual perl scripts looks as follows:

use Socket;
print "Content-type: text/html\n\n";
$sendmail = '/usr/sbin/sendmail -i -t';
$htaccess = "/home/mikeage/public_html/.htaccess";
$domain = "";
$warning_to = "ip_ban";
$warning_from = "ip_ban";
$date = scalar localtime(time);
$remote_agent = $ENV{'HTTP_USER_AGENT'};
$remote_addr = $ENV{'REMOTE_ADDR'};
$inetaddr = inet_aton("$remote_addr");
$remote_host = gethostbyaddr($inetaddr, AF_INET);
$remote_addr =~ s/\./\./gi;
$abuse1="abuse@" . $remote_addr;
$abuse2="abuse@" . $remote_host;
(-w $htaccess) or do {
print "Not writable!";
die $!;
open(HTACCESS,"+< $htaccess") || die $!;
@contents = <HTACCESS>;
unshift(@contents,"SetEnvIf Remote_Addr \^$remote_addr\$ denied \n\#$date $remote_agent\n");
print HTACCESS @contents;
print <<__WARNING__;
<title>Die Spammer!</title>
<p>You have triggered a trip-wire. This script exists solely to catch people doing things
they shouldn't be.</p>
<p>As a result, your IP address ($remote_addr) has been blocked from this entire site. You will
no longer be able to browse $domain. In addition, I have been alerted to your presence and will
be reviewing the records for possible action with your service provider.</p>
<p>If you have stumbled here by accident, you can email me at ip_ban at this domain to unblock
yourself. Be sure to paste in the network address in parentheses above so that I can unblock you.
If you don't mail me, you will <strong>NOT</strong> be able to get back to this screen again -
you are <strong>BANNED</strong>.</p>
<p>However, should you feel like spamming some people, try the following email addresses:<br />
<a href="mailto:$abuse1">$abuse1</a> or <a href="mailto:$abuse2">$abuse2</a></p>
open (MAIL, "|$sendmail");
print MAIL "To: $warning_to\@$domain\n";
print MAIL "From: $warning_from\@$domain\n";
print MAIL "Subject: \[Alert\] Bot Blocked\n\n";
print MAIL "IP address $remote_addr ($remote_host) has been blocked from accessing $domain
because it called $0 on $date. The agent was $remote_agent.\n\n";
close (MAIL);

Note that I do return two email addresses: abuse@spammers.domain and abuse@spammers.ip. Maybe they’ll wind up reporting themselves!

9 responses to “Block spam harvesters”

  1. hi,
    i installed your script and some bot gets trapped.

    but, the result is my server showing error 500.
    when i check the htaccess, the script create 3 line. first line is to deny the ip. second line showing the time it occured and started with #.

    the problem is with the third line. it show the user agent but do not start with #

    how do i fix this problem?


  2. Hrmm… there should only be two lines, that look something like this:

    SetEnvIf Remote_Addr ^$ denied
    #Fri Jan 26 15:49:14 2007 Python-urllib/1.16

    Perhaps the user agent included an embedded newline?

    The 500 is probably related, as a bad entry in .htaccess can cause a 500.

    If you visit the page yourself, does it successfully block you?

  3. Well, there’s no way to remember how much spam isn’t sent to me. I can tell you that I average about 10 sites a week blocked; usually that’s one or two search bots (mostly from China) that ignore robots.txt, and the rest appear to be spam harvesting PCs, probably virus infested home PCs based on the their IP addresses.

  4. We need help to reduce our spam. I had in mind using something like your “ixitan” symbols on our website, as it would certainly reduce the spam bots. How do we get it? What would it cost?

    I’m reluctant to screen out all that appears to be spam as we wish to be able to communicate with folks in need — but that may be wishful thinking. We don’t know a whole lot about what we need, but we know what we don’t need. Any help?

Leave a Reply

Your email address will not be published. Required fields are marked *