All in the <head>

– Ponderings & code by Drew McLellan –

– Live from The Internets since 2003 –

About

Preventing Comment Spam

22 March 2004

Spam in blog comments is a very real problem for a lot of bloggers, and in order to keep their sites spam-free, we’re seeing a good number of people take steps to prevent spam being posted. Some have taken to switching comments off after a set time period, others require registration, and some have turned comments off altogther. More behind-the-scenes techniques involve complete comment moderation, shared blacklists and such. Nearly all methods restrict either the freedom of the site owner in running their site how the want to, or the interaction of those who visit it.

The ‘smart’ spammers have figured out that popular blogging tools like MovableType use the same comment field names on every site, so writing a bot to post using those field names is pretty straightforward. Less advanced (or more authentic, depending on how you see it) spammers simply cruse and post manually.

Although this may be a recent phenomenon for blogs, the problem is combinations of two old friends – email spam and forum trolls. Surely then we can reuse what we already know about these two problems to help devise solutions for comment spam.

Something that comment spam often has in common with email spam is its content matter. For email spam we use keyword filters to pick up likely spam and flag it for attention. So how about we do the same for comment spam. If it triggers certain keywords, flag it for moderation and hide the comment until it’s approved.

Of course, not all comment spam has a direct message. A lot of it just says stuff like I agree and then links to the site the spammer is trying to promote. Keyword matching is no use here, as we’re looking at the quality of the post rather than the words used. This is a problem solved in many discussion forums, mailing lists and other online communities by moderating all new users until they are proven trustworthy. This is usually applied to some sort of user account or list subscription that isn’t desirable for a blog, but so long as you don’t publish commenter’s email addresses on the site (not a bad idea in itself) but require the user to comment with one, you can simply tie the moderation to the email address. The first time an address is used, the comment gets moderated – if approved no need to be checked again.

Both these techniques (ideally used together) might give the site owner the moderation options without forcing moderation on all comments, killing conversation and added extra admin overheads.

- Drew McLellan

Comments

  1. § Vinay Venkatesh: The main problem with this approach is that semantic matching is a very complex thing to do well. But I don’t know that it can be implemented well, without a LOT of work.

    But it’s a good idea. The way it works for email however is that there is usually another process running on the server. Most web hosts do not allow users to do this, so it would mean writing it into the perl/php/whatever-language-your-cms-is-in and tie it into your cms. This can only mean one thing. Performance takes a drastic hit.

    But I may not be seeing a better way to implement this.
  2. § Drew: As the hit would only occur at time of comment submission, this shouldn’t be too much of a concern. Unless you have an insanely busy site with a constant stream of comments, the performance hit shouldn’t be a problem at all.

    By the time performance was an issue, you’d be looking are a more serious forum setup than a simple comments feature, I reckon.
  3. § Michael: A rudimentary version of Captcha could be used to help reduce spam. WWdN uses something like this, and I believe (I don’t know for sure) this is how it works:

    1. Images with five or six digit numbers are created and obscured so a human could read them and a computer coudn’t.

    2. Each image filename (something random) and a “key” number are correlated (maybe in a database).

    3. A PHP script calls a file with a random key as an argument in the query string (imager.php?key=1). The imager file returns the specified image.

    4. The commenter types in the number they see.

    5. The number is checked and the comment is accepted or flagged.

    Something simple like that could be done. You could also use words or letters and number or whatever. Of course, it could be changed to use other methods, etc.
  4. § Drew: A drawback of Captchas is that they can create an accessibility barrier, because of course they can’t be understood by a screenreader or similar assistive devices.

    Anything impracticality is that this doesn’t stop the manual spammers. As I use a (currently) obscure CMS, the only spam I’ve had through so far is manually submitted.
  5. § Jesse Rodgers: Cheap Viagr4! D1r3ct from Canada! M4ke her scream!

    oops sorry.
  6. § Chris Vincent: I read someone suggesting the Captcha method with the addition of audio recognition. I could imagine this solving most of the accessibility problem. A low-quality MP3 of less than a second wouldn’t be a bandwidth burden, either.
  7. § Drew: Another alternative I’ve seen suggested is the use of a short comprehension question – which would be another possibility.
  8. § DD: I’d like to see a registration system being used my MovableType so one has to register and confirm their account via email. If they feel like causing chaos on blogs then their accound and IP can be banned.
  9. § DarkBlue: I’ve just come across this post via a link on asterisk* (http://www.7nights.com/asterisk/links.php). An amazing coincidence since I have just implemented a couple of comment spam defenses on my own site along with a short article describing the different mechanisms (http://urbanmainframe.com/folders/blog/20040323/page_1.htm).

    I have used a variety of devices to prevent spamming, including Captcha’s (with a nod to accessibility requrements) and blacklisting.

    Seems to be working well so far.
  10. § Dave: Why not keep a blacklist of sites instead? Any comment containing a link to a known porn/viagra/whatever site could be stopped until reviewed or discarded.
  11. § rotoass: I read all my spam and I’ve found most of my porn from my spam.. sometimes my mouse goes all over the place but I like it..
  12. § neri: In case you have your own server running, it might be possible to involve bayesan filters like bogofilter. Their need of performance is little and it does not desire plain emails formats to check. You can train it with any text file you like and you can check any plain text file you want. Bogofilter returns just 0 for no spam, 1 for spam and 2 for “not sure”. It shouldn’t be so difficult to write a server for it, could be even done by a php file. So the spam-detect service could be shared by several bloggers.
    Such short statements with links to other sites won’t be filtered. But at least it can throw all that regular stuff/shit.
  13. § Anthony K. Valley: For my MovableType weblogs, I use the Jay Allen’s MT-Blacklist, fully equipped with a community-supported blacklist of know MT spammers.

    As TypeKey Authentication releases with MT 3.0, it’s possible that Jay’s app may diminish. So I have been working/researching/porting a similar system for Dean Allen’s Textpattern. But time escapes me.
  14. § david: In order to get deeper you need to raise discussions regarding various aspects of the subjects including its pros and cons.

    No Download Casino

Photographs

Work With Me

edgeofmyseat.com logo

At edgeofmyseat.com we build custom content management systems, ecommerce solutions and develop web apps.

Recent Links

Affiliation

  • Web Standards Project
  • Britpack
  • 24 ways

About Drew McLellan

Photo of Drew McLellan

Drew McLellan has been hacking on the web since around 1996 following an unfortunate incident with a margarine tub. Since then he’s spread himself between both front- and back-end development projects, and now is Director and Senior Web Developer at edgeofmyseat.com in Maidenhead, UK (GEO: 51.5217, -0.7177). Prior to this, Drew was a Web Developer for Yahoo!, and before that primarily worked as a technical lead within design and branding agencies for clients such as Nissan, Goodyear Dunlop, Siemens/Bosch, Cadburys, ICI Dulux and Virgin.net. Somewhere along the way, Drew managed to get himself embroiled with Dreamweaver and was made an early Macromedia Evangelist for that product. This lead to book deals, public appearances, fame, glory, and his eventual downfall.

Picking himself up again, Drew is now a strong advocate for best practises, and stood as Group Lead for The Web Standards Project 2006-08. He has had articles published by A List Apart, Adobe, and O’Reilly Media’s XML.com, mostly due to mistaken identity. Drew is a proponent of the lower-case semantic web, and is currently expending energies in the direction of the microformats movement, with particular interests in making parsers an off-the-shelf commodity and developing simple UI conventions. He writes here at all in the head and, with a little help from his friends, at 24 ways.