Sam says you should read this
This blog was created with the BlogFile software, written by Samuel Levy.

You can find Sam on Google + and LinkedIn.
 

Captcha-less spam protection

I've been running blogfile for a fair while now. I've got over 20 blog posts here, some of which have gotten (and continue to get) reasonable amounts of attention. The result is that I've also had several thousand comments.

If someone were to go through the posts though, they may be tempted to call me a liar. "I can only count about 100 comments!" they might say. "Several thousand sounds like a massive stretch!"

Well, it's not. But luckily for you, you can't actually see most of the comments. That's because they're spam.

"You must be very vigilant, quickly flagging or deleting these spam comments, Sam!"

Nope. I'm pretty lazy. I prefer to just let the spam comments catch themselves. I also know that other people are pretty lazy, and don't want to discourage discussion by forcing users to enter in stupid captchas.

"So you have to analyse the text and figure out what is and isn't spam? That's a road that leads to false positives, and general confusion."

Yes, it is, which is why I don't do that. What I use instead is a captcha-less honey pot. It works on a couple of basic assumptions:

1. Spammers are running simple HTML-scraping scripts to find comment fields.
2. Once the form fields are captured, spammers will often make posts without using the actual loaded form.
3. Regular users don't pay attention to field names, or the underlying HTML content.

The simple run down is this. A bot scrapes the HTML and comes to the comment form. The first fields they see are 'name', 'email', and 'url'. Astute observers will notice at this point that I don't actually ask them for an email address to leave a comment. In fact I don't ask regular users for any of these fields. They're not even visible to you. They're hidden with CSS, but not by setting them to display=hidden; or display=none;, but by using/abusing overflow rules. Any content posted to these fields automatically tells me that the user posting was looking at the HTML, not the browser.

Next, I pre-fill a couple of other hidden 'verification' fields which capture a unique value that is generated for the user. If I can't find these in the post, or can't re-generate the same value based on user details that I've received in the post, then the user probably didn't visit the actual page where they supposedly posted the comment from.

Finally, I keep a (hashed) record of spammers, and check against it. This is used as an internal "karma" measure to catch repeat offenders.

So how well does it work? Very well. I've not yet have a single spam comment get through the filters, and not yet had a single false positive.

Is this long-term viable? That depends on a few factors. If my blog eventually gets to a level of popularity where people would specifically write scripts to target my comment fields, then no. It probably wouldn't work any more. That's not to say that all is lost, because I have plenty of other tricks up my sleeve, and if it's that popular then I can afford to spend some time on refining my system.

What about accessibility? Actually, to be honest, this plan may not fully work with screen readers, but I'm sure that there are some pretty simple changes that I could make to let it happen. I don't think that I get much traffic from people with screen readers, though, so for the moment the time and cost of making the adaptations isn't worth the pay off.

So that's how I've been protecting my blog from spam. It's worked very well, and I think it will continue to work for a fair while into the future. When it stops working, then some trivial changes should make it effective again.

Name
Web
Comment
Formatting: _itallic_, *bold*, #mono-space#, -strike-through-
Anonymous

Nice

 
Anonymous

Thanks for the interesting read, and having easily viewable code. Going to have to try this on future forms.

 

Interesting. Thanks.

 

It looks like the last 3 comments before me were spam comments, so apparently this tactic doesn't work on this blog (if you actually use it on this one.)

Sean's comment was spam because he just dropped a link while writing two words. the other two spammers are spammers for obvious reasons.

 
Samuel Levy

There have been a few comments people have added manually, which are meant to mimic regular spam. They're not being added by bots or automated scripts; they're being added by people.

What I've done wouldn't prevent that, nor would adding a captcha, or pretty much any other method of protection.

The idea isn't to protect against people who being are adding fake spam because this is a post about stopping spam; the idea is to protect against the thousands of automated spam comments that my blog gets hit with.