Block spam harvesters

On my site, I have a few “special” pages who’s only purpose is to ban bots that ignore robots.txt (or worse, use it as a hint for where “the good stuff” is!). Here’s how I do that:

robots.txt is a file that can be used to tell bots and search engines what pages are permitted to them, and what is forbidden. Some bots, however, do not bother checking this file first. Worse than that, others have been known to specifically hit the Disallow pages, on the assumption that anything blocked must have good content! (Of course, this is false– anything private will usually be secured by authentication, but since when has logic stopped a spammer?).

On my site, I have several hidden pages. In order to prevent well behaved bots from being blocked, they’re all listed in robots.txt as

User-agent: *
Disallow: /path/to/page

I have a special page there that actually blocks anyone who accesses it. The first page is at /cgi-bin/guestbook.cgi, but has no hyperlinks pointing. This is to catch spam bots that are hard coded to look for a common location for an exploitable script. I also have a link on my main page (with no link text, so a user won’t see it) to /site/S.P.A.M.T.R.A.P/. Since a user using a screen reader could potentially hit that page, there’s a doorway page there that warns the user not to go any further. Beyond there are two more links to other files (symlinked to the original .cgi) which will block an IP.
So how does it actually work?
The page consists of a perl script that adds a line to the .htaccess file which says the following: SetEnvIf Remote_Addr ^<IP ADDRESS>$ denied for each blocked address.

I also have

<Files *>
order deny,allow
deny from env=denied
allow from env=allowed
</Files>

In my .htaccess file, which tells it to actually use the environment variables!

The actual perl scripts looks as follows:

#!/usr/bin/perl
use Socket;
print "Content-type: text/html\n\n";
$sendmail = '/usr/sbin/sendmail -i -t';
$htaccess = "/home/mikeage/public_html/.htaccess";
$domain = "mikeage.net";
$warning_to = "ip_ban";
$warning_from = "ip_ban";
$date = scalar localtime(time);
$remote_agent = $ENV{'HTTP_USER_AGENT'};
$remote_addr = $ENV{'REMOTE_ADDR'};
$inetaddr = inet_aton("$remote_addr");
$remote_host = gethostbyaddr($inetaddr, AF_INET);
$remote_addr =~ s/\./\./gi;
$abuse1="abuse@" . $remote_addr;
$abuse2="abuse@" . $remote_host;
(-w $htaccess) or do {
print "Not writable!";
die $!;
};
open(HTACCESS,"+< $htaccess") || die $!;
flock(HTACCESS,2);
seek(HTACCESS,0,0);
@contents = <HTACCESS>;
unshift(@contents,"SetEnvIf Remote_Addr \^$remote_addr\$ denied \n\#$date $remote_agent\n");
seek(HTACCESS,0,0);
print HTACCESS @contents;
truncate(HTACCESS,tell(HTACCESS));
close(HTACCESS);
print <<__WARNING__;
<html>
<head>
<title>Die Spammer!</title>
</head>
<body>
<p>You have triggered a trip-wire. This script exists solely to catch people doing things
they shouldn't be.</p>
<p>As a result, your IP address ($remote_addr) has been blocked from this entire site. You will
no longer be able to browse $domain. In addition, I have been alerted to your presence and will
be reviewing the records for possible action with your service provider.</p>
<p>If you have stumbled here by accident, you can email me at ip_ban at this domain to unblock
yourself. Be sure to paste in the network address in parentheses above so that I can unblock you.
If you don't mail me, you will <strong>NOT</strong> be able to get back to this screen again -
you are <strong>BANNED</strong>.</p>
<p>However, should you feel like spamming some people, try the following email addresses:<br />
<a href="mailto:$abuse1">$abuse1</a> or <a href="mailto:$abuse2">$abuse2</a></p>
</body>
</html>
__WARNING__
open (MAIL, "|$sendmail");
print MAIL "To: $warning_to\@$domain\n";
print MAIL "From: $warning_from\@$domain\n";
print MAIL "Subject: \[Alert\] Bot Blocked\n\n";
print MAIL "IP address $remote_addr ($remote_host) has been blocked from accessing $domain
because it called $0 on $date. The agent was $remote_agent.\n\n";
close (MAIL);
exit;

Note that I do return two email addresses: [email protected] and [email protected]. Maybe they’ll wind up reporting themselves!

‍‍כ״ח תשרי תשס״ז – October 19, 2006

Mike Miller

Uncategorized

Spam, Technical, Tip

9 responses to “Block spam harvesters”

dennyhalim.com says:

‍‍י״א שבט תשס״ז – January 30, 2007 at 05:08

hi,
i installed your script and some bot gets trapped.

but, the result is my server showing error 500.
when i check the htaccess, the script create 3 line. first line is to deny the ip. second line showing the time it occured and started with #.

the problem is with the third line. it show the user agent but do not start with #

how do i fix this problem?

tia
dny

Reply
Mike Miller says:

‍‍י״א שבט תשס״ז – January 30, 2007 at 05:15
Hrmm… there should only be two lines, that look something like this:
```
SetEnvIf Remote_Addr ^67.43.156.66$ denied
#Fri Jan 26 15:49:14 2007 Python-urllib/1.16
```
Perhaps the user agent included an embedded newline?

The 500 is probably related, as a bad entry in .htaccess can cause a 500.

If you visit the page yourself, does it successfully block you?
Reply
dennyhalim.com says:

‍‍ד׳ אדר ב’ תשס״ז – February 22, 2007 at 10:18

i fix it by adding a # just before $remote_agent

Reply
dennyhalim.com says:

‍‍ד׳ אדר ב’ תשס״ז – February 22, 2007 at 10:24

btw. tnx for making this great yet simple script free.

i symlink formmail.cgi to this script and seems like it catch quite some bad bot
even that i never have any link to formmail.cgi anywhere.

Reply
Mike Miller says:

‍‍ד׳ אדר ב’ תשס״ז – February 22, 2007 at 12:04

Glad it’s now working.

A lot of bots are hardcoded to look for a formmail.cgi ; in the old days, I also had mine symlinked to guestboot.cgi .

Reply
BD says:

‍‍י״ד ניסן תשס״ז – April 1, 2007 at 23:42

After months of using it, are you finding this technique successful?

Reply
Mike Miller says:

‍‍י״ד ניסן תשס״ז – April 2, 2007 at 06:33

Well, there’s no way to remember how much spam isn’t sent to me. I can tell you that I average about 10 sites a week blocked; usually that’s one or two search bots (mostly from China) that ignore robots.txt, and the rest appear to be spam harvesting PCs, probably virus infested home PCs based on the their IP addresses.

Reply
Perry A. Chapdelaine, Sr. says:

‍‍ז׳ תמוז תשס״ח – July 9, 2008 at 23:45

We need help to reduce our spam. I had in mind using something like your “ixitan” symbols on our website, as it would certainly reduce the spam bots. How do we get it? What would it cost?

I’m reluctant to screen out all that appears to be spam as we wish to be able to communicate with folks in need — but that may be wishful thinking. We don’t know a whole lot about what we need, but we know what we don’t need. Any help?

Reply
adapsvalm says:

‍‍כ״ב כסלו תשס״ט – December 19, 2008 at 09:05

Hi all!

As a fresh mikeage.net user i only wanted to say hi to everyone else who uses this board 😎

Reply

Block spam harvesters

9 responses to “Block spam harvesters”

Leave a Reply to Mike Miller Cancel reply