Defending Against Spam Bots

Chris

Head Coder

19,519

∞

24

June 2005

Defending Against Spam Bots Jul 30, 2009 6:00:17 GMT

Quote

(I want to apologize ahead of time for this tutorial being a wall of text. My bad. Also, all code present in this tutorial was written on the spot and not tested. I am discussing theories and examples of implementations, not directly explaining a script.)

Introduction
One of the biggest annoyances on websites are spam bots. Whether it is a forum or a website, spam bots cause numerous issues and irritate nearly everyone. You have probably encountered them in your time spent across the internet. I'm here today to discuss various methods for defending against and defeating these spam bots so that you can help keep your website safe and clean. I'll also show some basic implementations as examples. All my examples will be in PHP, but I'm sure they can be easily implemented in other languages such as Perl or ASP.

Method One: CAPTCHA Images
Let's begin by introducing you to what a CAPTCHA is. CAPTCHA = Completely Automated Public Turing test to tell Computers and Humans Apart. Most people use the term CAPTCHA to only refer to one method of this though. They refer to the image reading method commonly used. (From here on out, "CAPTCHA" will mean this method.) This method is where a random image is generated by the server. This image contains a string of characters, typically random. For example, this string could be "N7GAGAS" However, if the string just appeared plainly in the image, unfortunately spam bots could read it. Thus, most of these are difficult to read.

Below is an example of a sample CAPTCHA image with the characters above:

As you can see, it poses a slight challenge when a person attempts to read it. This brings forth two major (and related) consequences: multiple attempts and user frustration. If the user thinks they have the right code entered the first time and submit the form, they may be quickly proven wrong by your system. There are a few common pitfalls as to what they might have done wrong (or you might have done wrong):

1. The letter o and the number 0 often get confusing to an user
2. Lower case L, upper case i, and the number 1 often get confusing to an user
3. Upper and lower case letters may not be validated as the same, causing issues
4. V and U may be confusing to an user based on the font
5. Two V's and a W may appear to be the same

All these pitfalls are easily resolved by you, the programmer. In your code that randomly generates the string, you can just as easily include only one of the above items from each list. Or, you can not include any at all. That may prove to be an even better solution. As for the validation of upper and lower case letters, a simple $str = strtolower($str) does wonders. Just keep in mind that it is up to you to make sure this method is useable.

Anyways, back to the two major consequences of the challenge when trying to read a CAPTCHA. The second is user frustration. Users have a tendency to dislike CAPTCHAs in general, whether they are easy to read or not. So, when they find a particularly difficult CAPTCHA to read, their anger increases greatly. Not to mention that multiple failed attempts at validating the CAPTCHA are likely to irritate them further. This can also be increased greatly by issues such as having to reenter form data after each failed submission because it got cleared or the fact that passwords are not transferred for safety reasons.

The CAPTCHA image is one of the most commonly used methods of spam bot protection. It can be implemented fairly easily and can be fairly effective, though just as annoying as well. If implemented properly, it is useful to both the users of the system and the owners of the system. Your best bet is to find an easily readable, yet protective CAPTCHA. One free CAPTCHA I recommend is reCAPTCHA. It is very efficient and free. You just need to register and get an API Key.

(I didn't feel it was necessary to show how to implement a CAPTCHA image since reCAPTCHA would be my example and it provides examples in it's source.)

Difficulties with this method: Unable to read letters/numbers. Can't differentiate between similar characters.

Method Two: Can You Pick The Puppy?
This method is by far one of the easiest for the user to use. It provides a slight challenge in setting up, but it is quite rewarding and efficient. The basic setup for this system is that you give a user a question, asking them to identify something, and then provide them with images of possible answers. The example I'm using is a list of about 20 different things. (If you implement this method, you'll want to create a list of probably well over 100 or 150 items, if not more. It also may be worthwhile to use a DB table.) The user will be asked randomly to pick one from this list, except only 8 will be shown each time. So, let's take this random list of 20 different things:

<?php
$items = array(
	// SYNTAX: array("IMG FILE NAME", "OBJECT NAME"),
	array("horse.gif",    "Horse"),
	array("dog.gif",      "Dog"),
	array("cat.gif",      "Cat"),
	array("sheep.gif",    "Sheep"),
	array("spider.gif",   "Spider"),
	array("goat.gif",     "Goat"),
	array("dragon.gif",   "Dragon"),
	array("squirrel.gif", "Squirrel"),
	array("bug.gif",      "Bug"),
	array("rocket.gif",   "Rocket Ship"),
	array("smile.gif",    "Smiley Face"),
	array("human.gif",    "Human Being"),
	array("gun.gif",      "Gun"),
	array("knife.gif",    "Knife"),
	array("hole.gif",     "Hole"),
	array("shovel.gif",   "Shovel"),
	array("purse.gif",    "Purse"),
	array("wallet.gif",   "Wallet"),
	array("book.gif",     "Book"),
	array("ball.gif",     "Ball")
);
?>

So now we have our list of 20 items in the $items array. First thing we need to do is randomly select the 8 we'll use. So, let's shuffle our array and then take the first 8 items. (Please note that in versions of PHP prior to 4.2.0 you need to seed the the random number generator. I didn't feel it was necessary to include that in this tutorial.)

<?php
shuffle($items); // Shuffle into a random order
array_splice($items, 8); // Splice off everything after the key 7 (8th key)
?>

$items now includes our list of 8 random items. Now we need to select one item to be our valid item. Also, while we're at it, we should save these 8 selected items into the $_SESSION variable so that our program that generates the images via will actually know which one is which.

<?php
session_start(); // Start the session. This may cause an error if you have called it earlier, so don't call it twice in a program.
$_SESSION["selected_items"] = $items; // Save them in the session
$success_key = rand(0,7); // Selects a random whole number between 0 and 7, inclusive. This will be our valid key.
$_SESSION["success_key"] = $success_key;
?>

Now that we have our random item, we'll write some code that outputs a list of the images, another file called "img.php" which will load the files from our folders and output them through a hidden URL, and then write the form that will submit the results. I'm going to skip most of this since you should be able to handle it. I will mention that it is necessary to use some form of an "img.php" program which will create a changing URL for the items' image files. If you didn't, a simple check of the file name would give away which one is the correct answer. Below is the validation code from the submitted form. The field name for the user's answer is called "answer" and was submitted via GET instead of POST. Assume that $items is also defined in this code.

<?php
$answer = intval($_GET["answer"]); // Grab the answer. Make sure it's an integer value.

// First check for a valid answer
if($answer != $_GET["answer"] || $answer < 0 || $answer > 7){
	// If the intval() version doesn't equal the original, we probably have a bot attack
	// If it is less than 0 or greater than 7, we probably have a bot attack, once again.
	die("Error: Invalid answer given.");
}

// Compare to session value
session_start(); // Load session
if($answer != $_SESSION["success_key"]){
	die("Invalid answer provided.");
}

// Has passed all the checks. Let them access whatever they want from here on out
?>

So now that we've validated their answer, we can let them do whatever they were trying to do before.

This is the basic outline of this method. Some people might not like that a guessing bot has a 12.5% chance of getting the right answer. 1/8 is a pretty good chance, especially if they keep guessing. It should theoretically take them only 8 tries to get the correct answer. However, this is still an improvement over the 100% chance if it was not implemented. It is for this reason that you could increase the amount of 8 items to 16 items displayed. Keep in mind that it is quicker for a spam bot to collect information on your huge list of items if 16 are shown as opposed to 8.

There are also other boundaries to consider. If a person has just learned English, they might not have a wide enough vocabulary to be able to identify the item you're asking for. Because of cultural barriers, a person may not actually know what a "purse" or "wallet" are, nor what a "squirrel" is even. If you include obscure objects, some people may not recognize them. For example, while a picture of Albert Einstein is very distinctive, some may not recognize him. This would cause frustration when they get the answer incorrect. However, this also leaves them able to use the power of deduction in order to get the correct answer.

Difficulties with this method: Language and culture barriers might prevent objects from being known. Obscure objects may not be known. It is time consuming to compose a list of items. Bots might catch on quickly. Somewhat high probability of guessing the correct answer.

Method Three: Finish This Sentence
This method is fairly similar to Method Two. This method is a "complete the sentence" game, similar to all those annoying standardized tests students have to take. The basic method behind this is to compile a large list of common phrases and then have the user complete the phrase. For example, you could have this setup:

"To be, or not to (BLANK INPUT), that is the question."

In this case, the answer is "be." Once again, it poses a challenge to a spam bot to even attempt to fill in the blank. However, it also backfires. Not everyone in the world knows that quotation, so not everyone is able to answer it. This again causes more frustration for the user. It is by far more frustrating than the previous method if a user does not know the correct answer. For you, the programmer, there is also difficulty when it comes to creating the list. It is difficult to come up with a large list of well-known phrases, thus spam bots could easily compile a list of your quotes and then bring in a human (and at a fairly cheap rate) to complete the quotes.

I did not provide any PHP example code since you could easily use the coding from Method Two to implement this defense.

Difficulties with this method: Language barriers prevent common phrases from being known. Different cultures and different countries have different well-known phrases, thus some may not know the one you use.

Method Four: It Had Better Be Blank
This method is a good method to combine with others. It can be easily applied to just about any form.

First, let's take a look at a basic posting form:

<form method="post" action="message.php" name="sendmessage">
	Username: <input type="text" name="username" maxlength="30" /><br />
	Message: <textarea name="message" style="width: 400px; height: 100px"></textarea><br />
	<input type="submit" name="submitit" value="Send Message" />
</form>

That is the form we're going to edit. Below, I added another text input that is hidden. Take a look at it carefully.

<form method="post" action="message.php" name="sendmessage">
	<input type="text" style="display: none" name="title" maxlength="50" />
	Username: <input type="text" name="username" maxlength="30" /><br />
	Message: <textarea name="message" style="width: 400px; height: 100px"></textarea><br />
	<input type="submit" name="submitit" value="Send Message" />
</form>

As you can clearly see, this new field is hidden. So what's its use? Normally, a spam bot goes through and just fills random details into every blank in a form. If there is ANY content in this field, we can assume it's a spam bot. Why? Because a user can't even -see- the field, let alone fill in information. Our PHP check would look something like this:

<?php
if(isset($_POST["title"]) && strlen($_POST["title"])){
	die("SPAM BOT");
}
?>

Easy enough, right? It's a simple method that is easy to implement and fairly effective as well as a side-measure.

Difficulties with this method: If this is your only reliance for anti-spam, it is very weak.

Method Five: JavaScript Submission
This method is fairly straight forward. It is based off the fact most spam bots have JavaScript disabled. So, to combat this, we require JavaScript to be enabled in order to submit the form in question. Below is an example form:

<form method="post" action="message.php" name="sendmessage">
	Username: <input type="text" name="username" maxlength="30" /><br />
	Message: <textarea name="message" style="width: 400px; height: 100px"></textarea><br />
	<input type="button" name="submitit" value="Send Message" onclick="this.form.submit()" />
</form>

Now, this form will only submit if the person clicks on the button. This prevents bots that rely on a text-based page-viewing from spamming.

Difficulties with this method: Users with older browsers may not have support for JavaScript. Some browsers may have errors with your methods. Will constantly need to be updated for new browsers. Some bots could catch on and submit it this way.

Conclusion
After studying different ways to prevent spam bots, I've decided to use different combinations of the above methods. The easiest method I've discovered is, well, a combination. A combination of Method One, Two, or Three with both Four and Five is fairly effective and can stop most spam bots from getting through. However, nothing trumps this method for preventing spam.

A helpful tool if you're an experienced developer is implementing Akismet into your website. For more information, visit Akismet's website.

Last Edit: Jul 30, 2009 6:00:28 GMT by Chris

newfieldgrafix
Guest

Defending Against Spam Bots Jul 30, 2009 7:00:41 GMT

Quote

Nice resource. But you can get bots that can get thru the less complex CAPTCHA. I prefer the hidden field combined with the "What is Britnni Speeres".

I take a more forward approach to it in the form of Project Honey Pot. Personally, I thing malicious bots should be illegal!

Michael

*Has a custom title*

1,462

1

October 2007

Defending Against Spam Bots Jul 30, 2009 7:11:04 GMT

Quote

I'd name the hidden field as 'email' or something Chris!

As I think it's only select fields that bots fill in.

Chris

Head Coder

19,519

∞

24

June 2005

Defending Against Spam Bots Jul 30, 2009 15:02:24 GMT

Quote

Jul 30, 2009 7:11:04 GMT Michael said:

I'd name the hidden field as 'email' or something Chris!

As I think it's only select fields that bots fill in.

"email" would be a good name for it as well. I think "title" would be a good example if you're using it on a forum though. Either way, it's still something you want to implement since it takes little-to-no work.

---

Zell: Yep. They can read some of the more simple CAPTCHA images. That's why people are constantly developing ones that are harder for bots to read... and thus harder for humans to read.