They do their best to clog up every public information exchange with their porn, drugs, financial scams, and other such junk. It's a classic tragedy of the commons: it costs them almost nothing to post those messages, and even though the response rate must be ridiculously low, there's enough to bring them some benefit, though it ruins things for everybody else.
I actually long for the days when spam was just an email problem. But today, in addition to email, spambots post to every blog and forum they can get their grubby little zombie hands on. That includes not just the common software like phpBB; even our own blog and forum get many dozens of spam posts per day, despite the fact that this is brand new software just written a couple months ago. When we add a user-comment feature to the documentation wiki, no doubt spammers will attack that too. And if you ever develop any web app with a public comment/message system, be prepared for the zombie hordes to start posting spam to your app as well.
But enough grousing. What can we do about this infestation? Here are some things we've tried, and some that we have not (yet):
1. Require email confirmation of posts and account registrations. If you've commented on this blog, or on the forum (without signing in), then you've seen this. Spambots almost always post with a fake email address, and will never see your confirmation message, much less reply to it. (But not always: one day we had a half-dozen spam comments posted and actually confirmed... so see item 2.)
2. Add the ability to ban (blacklist) by email address. If you do get a spammer dumb enough to confirm his email address, then you add it to the blacklist, and reject any future postings automatically.
3. Add other administrator features that make it quick and easy to filter out the spammers. I'll illustrate this by counterexample: I run a forum on polywell fusion, based on phpBB (this predates Yuma). I get a couple hundred spammers signing up for accounts on that per day, and phpBB provides no easy way to weed them out. Its interface is so awful that I wrote a desktop RB app that connects to the database (over a rather convoluted shell-and-SSH pipeline) and lets me quickly sort and ban these bozos. In your own apps, try to build that functionality in.
4. Even better, registrations or posts should automatically expire if they're not confirmed by email within a short time. We're not doing this here yet, but we should, because frankly I'm getting tired of going into the comment list several times a day and deleting all the spam, which is almost always unconfirmed.
5. Finally, another trick we haven't tried yet: change the names of your form fields so that they don't look so inviting. For example, the comment form you see below currently has inputs named "name", "url", and "content". It doesn't take a very bright bot to figure out what that is. What if we called them "x", "y", and "z"? Or "phone", "disk_size", and "card_number"? Of course changing them might interfere with the auto-fill function many web browsers provide, since that is also based on field name. But if it confuses the spambots, perhaps it's worth it.
If you have any other ideas, war stories to share, etc., please share them below. And if you have drugs, porn, or pirated software to sell, please go away!
I run a phpBB forum for our neighborhood. Spambots loved to register for the site. I then *added* a field to the registration form requiring them to enter their street address. On validation, I checked that the street address contained one of the four valid streets in our neighborhood. That shut down the registration spambots, let our neighbors continue to register, and had an added benefit of helping to identify neighbors when they registered with the name of "CookieMonster123". {grin}
That's a clever solution, Kevin! A neighborhood forum is a neat idea, too.
I should have also mentioned using a Captcha. That's a very popular approach these days, though like almost anything else, it engages you in a technological arms race with the spammers — there is software these days that can break most popular captchas most of the time. Still, in combination with other techniques, I bet it reduces the spam quite a bit.
On the subject of captchas, any plans to implement graphics to replicate some of PHP's GD functions? imagestring springs to mind.
Sorry, I decided the question was more appropriate on the forums
Don't forget about the content of the messages themselves. For my own forums I have implemented some basic filtering based on content.
I check for standard spam words and their variants that are unlikely to occur in a legitimate posting and assign a numerical weight for each of them. If the "score" of a posting is higher than a certain threshold it is rejected.
I also limit a poster to a maximum of two URLs, and for any posting that gets banned I add any URLs it contains to a blacklist so that any post which contains those URLs will also be rejected.
That last trick alone has reduced spam on my site to basically zero, and yet I still allow anonymous posting, which is much more convenient for visitors.
Don't be afraid to set a low threshold for content filters - as long as you tell a user why their post has been rejected, it's easy for them to modify it so that it gets through - a bot is unlikely to be able to do that though.
Those are some great ideas! Maybe we can wrap up some of that functionality in an open-source post-filtering module, to make it easier to add such tricks to any system that involves user comments/posts.
Incidentally, it looks like the reCAPTCHA project could be used from a Yuma app. Ths is a free captcha (including both visual and audio, which is important for visually impaired users) that you can embed into pretty much any web app.
As you can see, we've now added a captcha (care of reCAPTCHA) to our blog. Prior to this, we've been getting a couple dozen spam comments per day. We'll see how this cuts it down, and let you know!
on the subject of reCAPTCHA, it also should be mentioned that it also serves to ocr hard to read words in scanned books, so its a win-win situation.
Also, I noticed that in regards to urls you just put a normal link. It is much better (and suggested by google) to use rel="nofollow" (see info here: http://en.wikipedia.org/wiki/Nofollow) as it will stop bots that are out to do dirty seo tricks.
You guys should add that to the forum too. By the way, is the source for the forum or blog available online?

