Power User Monday Tip of the Week
Keep Bad Bots Out of Your Site
Whether or not you followed our Create a Website for Free tutorial, your site or blog may be hit by a wide variety of spammers or bandwidth-wasting bots. Fortunately, blocking this scourge is far more easy than you think.
As you know, there are plenty of comment spam bots, referer spam bots, and email-harvesting bots out there. Fortunately, most of them can be easily blocked via .htaccess. What we're going to do today is set up a file named .htaccess, which will then set certain visitors to an env value of "spammer". We will then set .htaccess to block any visitor with an env value of "spammer". If you already have a .htaccess file in the main directory of your web site, then you can skip the next paragraph.
First, you'll need a text editor, like Smultron. If you are not a Mac user, see this list. Once you have entered your data into the new file (we'll do that in a minute), you should save it as ".htaccess". Now, the dot in front of the file will render it invisible, but there are plenty of ways to deal with that. See Managing Invisible Files for more info. Now, upload this file to your server. .htaccess files can be placed in any directory on your server. Each .htaccess file protects both the directory that its in and any subdirectories below that. So, a .htaccess file in your main directory will protect your entire site.
Now for the important part. Setting up your .htaccess file to block spam referers, spam user agents, spam proxies, and spam IPs is actually very easy. To start off, this is what my .htaccess file looks like:
SetEnvIfNoCase Via pinappleproxy spammer=yes
SetEnvIfNoCase User-Agent "lwp-trivial/1.38" spammer=yes
SetEnvIfNoCase Referer spamdomain.com spammer=yes
Order allow,deny
allow from all
deny from env=spammer
deny from 12.163.72.13
As you can see, we have singled out the proxy named "pinappleproxy", the user agent named "/wp-trivial/", and the referer "spamdomain.com". We have then set each of those to have an env value of "spammer". We have set the .htaccess to allow before denying (so that no innocent bystanders get caught) and have set it to allow every visitor unless its env value is "spammer" or its IP address is 12.163.72.13. Did I lose anyone? Let's break it down.
If you know what proxy your annoying spammer is coming from, and you can see that no legitimate users use the same proxy, then use the following string:
SetEnvIfNoCase Via evilspamproxy spammer=yes
If you know what user agent your annoying spammer is visiting with, and you can see that no legitimate users visit you with the same user agent, then use the following string:
SetEnvIfNoCase User-Agent "evil spam user agent" spammer=yes
If you know what referer your annoying spammer is coming from, and you can see that no legitimate users come from the same referer, then use the following string:
SetEnvIfNoCase Referer spamdomain.com spammer=yes
Now, this is the important part, and it must follow your SetEnvIfNoCase rules:
Order allow,deny
allow from all
deny from env=spammer
If you also know what IP address your annoying spammer is visiting with, and you can see that no legitimate users visit with the same IP, then add the following string right after "deny from env=spammer":
deny from 12.163.72.13
Keep in mind that IP addresses are often spoofed (faked) by spammers. If you do block spammer IPs, you could be blocking legitimate users as well. Please limit your IP blocks to 24 hours.
If that was a little bit too much for you, or you're looking for something that's a bit more automated, you're in luck. Bad Behavior is a PHP script designed to stop bad bots from visiting your site. In short, Bad Behavior blocks any bots that behave badly and any humans that act like badly behaving bots. As well as being a general PHP script, which can be run from any other PHP script or page, Bad Behavior is also available as a WordPress, MediaWiki, and Geeklog plugin. For more information on installing and using Bad Behavior, see this guide.
A Note About Robots.txt:
Legitimate robots can easily be controlled by creating a robots.txt file. This file should be created with a plain text editor and uploaded only to the root directory of your site. The syntax is very simple. The following example tells all bots to stay away from the "this" directory and the "that.html" file in the "other" directory:
User-agent: *
Disallow: /this/
Disallow: /other/that.html
Unfortunately, most bad bots completely ignore robots.txt.
I hope you enjoyed today's Power User Monday Tip at least as much as I enjoyed writing it! Y'all come back now! Y'hear? ^_^
MacMerc.com is not responsible for lost or damaged websites.
All personal comments should be sent to the author. All other discussion should be done in the Forums
[ Back to Power User Monday Tip of the Week | Sections Index ]




