In addition to working on potential solutions to referrer log spam, I’ve been investigating the problem of comment spam, with a view to producing a plugin for Textpattern when the inevitable happens.
My goal is to implement a comment spam blocker that is stand-alone and automated. That means it can’t rely on moderation, keyword lists or blacklists.
The plan was to use a series of hurdles designed to stop automated submission scripts, but allow real web browsers with human users to pass. The hurdles would include an extension of the nonce system already employed by Textpattern; a simple javascript-based proof of work; and (where javascript is unavailable) a visual CAPTCHA.
I’ve been aware of Captchas for a while. I studied AI at university, so I know first hand how difficult it can be for a machine to solve a turing test. Fraudsters have been using OCR-based crackers to break the Captchas used by large, rich targets like Ebay for some time now; but Captchas have been getting progressively better, and multi-word Captchas can probably be tailored to reduce automated exploits to a manageable level for most purposes.
Unfortunately, it turns out a very clever hacker discovered a virtually infallible way of breaking Captchas without resorting to an AI arms race: by copying the image and presenting it as a Captcha on a bogus ‘free porn’ web site. An unsuspecting human solves the problem for them, and the spammers return the answer to the original site. There are no shortage of people seeking free porn, so it’s done very quickly.
The “free porn” method can be used to break any Captcha, not just visual ones. It doesn’t matter how cleverly designed or how hard it is for a computer to solve; if a human can use it, it’s broken2.
In fact, the attack isn’t limited only to Captchas. It could probably be used with any proof-of-work system, like the Hashcash-style methods that are starting to emerge. The spammers simply copy the problem, present it on their bogus porn site, let someone else do the work, then relay the results back to the original site.
A related problem is commonly considered in cryptography: the man in the middle attack, in which an adversary intercepts, modifies and relays messages between two systems. Most cryptographic solutions use some form of public-key authentication, which typically relies on a trusted third party or web of trust.
Secure proof-of-work systems like Hashcash for email use server authentication to eliminate MITM replay attacks. A Hashcash client ensures that the token it generates is valid only for the email host it intends to communicate with. That’s not possible in a web application, because the client doesn’t run trusted code. The bad guys can simply serve up their own Javascript that calculates a token for their target, and the user won’t know any different.
Back in the blogosphere, the trusted third party approach is exactly the one used by Sixapart to tackle MovableType comment spam: Typekey. A proof-of-identity authentication system may in fact be the only other alternative to proof-of-work challenges.
Now comes the disheartening realisation.
If I’m right, and proof-of-work challenges are ultimately unworkable for comment spam, then authentication systems (centralized or otherwise) are also broken. After all, spammers can simply sign up for a throwaway account, use it until it gets blacklisted, then sign up for another. So how does an authentication system like Typekey prevent spammers from automatically signing up for multiple accounts?
Any authentication system that attempts to prevent spam based on comment authors’ identity will suffer the same problem. The only way to slow spammers from signing up for many throwaway accounts is to use a proof-of-work challenge, a method that we already know is broken, or at least doomed. Indeed, concentrating the point of attack in a single place (the registration page) probably makes the spammers job easier, since they don’t need to worry about different kinds of captcha and proof-of-work challenges in use on different blogs.
It doesn’t really make any difference whether the authentication system is centralized, like Typekey, or distributed, like Identity Commons. That’s just an implementation detail that determines where the database of identities is stored. The problem is that the only thing an identity authentication scheme proves is that someone registered using that particular login name or email address. And knowing someone’s name or pseudonym doesn’t solve the security problem we have, because we don’t know who the spammers are. Proof of identity doesn’t help us determine someone’s intentions, which is exactly what we need to know in order to decide whether or not to allow them to post a comment.
The only solution I can see at present1 is an authentication system based not on the idea of identity, but on reputation. A web of trust: I know the user who authenticates herself as so-and-so is an acceptable risk, because 27 other people I trust vouch for her. It reduces to a different kind of proof-of-work; one that can’t be delegated to an unsuspecting porn seeker, because it involves convincing a bunch of people that you are trustworthy.
Unlike Ebay’s reputation system, which is based on the notion of an absolute, global measure of trustworthiness for each person, a Web of Trust uses a relative measure of trust. Creating a bunch of bogus identities in order to artificially inflate someone’s score won’t work, because what matters is not how many people vouch for them, but who endorses them. Ebay (and Google’s Pagerank) answers the question, “what is this person’s relationship to the population in general?”. A web of trust is quite different; it answers the question, “what is this person’s relationship to me?”.
Fortunately, the blogosphere already has its own web of trust mechanism. It’s called a blogroll.
—-
1 Other than a centralized system that charges a fee for registering identities, or attempts to use “official” ID papers to map online identities to meatspace persons – both of which have been tried unsuccessfully in the past, and both of which create all sorts of privacy and security problems.
2 As several people have pointed out in comments, there’s another reason to consider Captchas broken: accessibility problems. I haven’t covered those issues here because they’re well documented elsewhere.
18 December 2004, 10:01 by Alex ·
Commenting is closed for this article.
Alex is a software developer from Melbourne, Australia. Threshold State is his consulting business.
“Labor is committed to introducing mandatory ISP filtering.” – Stephen Conroy, the new Communications Minister.
An excellent, minimal text editor for Windows.
All *.wordpress.com blogs have been blocked in Turkey – apparently because of one person, Turkish creationist Adnan Oktar.
The Opera browser team measured the percentage of people who use certain features. Several popular feature requests turned out to be unused, or almost so.
If Google rank denial (where a url is redirected through a localhost script) was implemented ‘out of the box’ in TXP’s default install, I doubt very much if we’d ever see much TXP spam.
For example, comment providers HaloScan get very little spam because URLs in the comments are not (generally) indexed Google, and there’s virtually no comment spam on Blogger for similar reasons.
The ‘web of trust’ is an interesting idea but I’m not sure if I completely understand it. Could each bogus identity vouch for other bogus identities, thus artificially boosting their collective score? And what stops SEOs with blogs validating their own bogus identities… and so on…
Essentially the web-of-trust is what google does with parts of page rank, and I think we’ve seen how that system can be exploited.
— nixon Dec 19, 06:28 am #
In a web of trust, pointers from unknown sources are ignored. It doesn’t matter if a spammer creates thousands of bogus identities or blogs to vouch for him (as is common on Ebay); only affirmations of trust from known sources are measured in a WoT.
An Ebay-like reputation system tries to calculate a single, global repuation score for each identity. A web of trust doesn’t do that; it answers the question, “what is this person’s relationship to me?”
Google’s Pagerank isn’t really a WoT either, it’s essentially the same as Ebay – it calculates a single Pagerank score for each “identity” (web site, or page). Such schemes are easily exploitable, but the same exploit doesn’t work with a web of trust.
— Alex Dec 19, 07:02 am #
— Bill Kearney Dec 19, 07:08 am #
That was my first thought, but unfortunately it doesn’t really help. The spammers can simply use a spider to copy the Captcha image (or Javscript hash challenge, or whatever else) and serve it up on their own page. The original site can’t detect this. (If you could detect the spammer’s spider, you wouldn’t need Captchas in the first place)
A short lifespan probably won’t help either – you’re playing against a porn site, which has no shortage of users at any given moment. There’d be no problems solving it in a matter of seconds.
The only measure I can think of that might help is to include some text in the Captcha image identifying the originating web site. That won’t prevent the problem, but it might give the porn site’s users a clue that something is wrong.
— Alex Dec 19, 07:28 am #
A spammer could register many TypeKey accounts, however MT gives each weblog the option of moderating the approval of first-time commenters with TypeKey. A would be spammer would have to post one intelligent comment that was in context for the associated post that would fool the weblog publisher into believing they are a legitimate commenter. It would seem to me that this requires another level of human intelligence above a captcha or other proof-of-work methods and would be virtually impossible to program or too time intensive for the spammer or even the masses seeking “free porn.”
Also, since TypeKey requires an identity (however bogus or whatever the intentions) it makes identifying and defending their activity easier then IP Banning or throttling. While far from perfect, it is an improvement.
I like the idea of a reputation system element to these systems quite a bit. Combined with some of the concepts of Blacklist, it could also be used as a means of rapidly responding to serial comment spamming in a distributed fashion. If a few weblogs report getting flooded with spam (perhaps the throttle mechanism is the trigger) by a TypeKey identity (or group)—other user’s systems can be alerted (also in an automated fashion) to suspend their account and shut down their ability to continue.
Need to think about it some more.
— Timothy Appnel Dec 19, 07:40 am #
I also share your concerns about captchas but for slightly different reasons: I’m dyslexic and have a huge problem with image-based turing tests. Some I can see, whereas others take me ages to decode, if i’m able to do it all. My dyslexia is quite mild so I guess others have real problems with them.
There are other forms of captchas but I’ve no idea how you’d do an audio-based one in PHP on a standard LAMP system. I guess you could use some form of logic puzzle but i don’t know how you’d make it accessible to visually impaired folks, and of course, all these methods would be ‘crackable’ with the methodologies you’ve discussed.
**
What are you thoughts on page-rank denial? If a blogging platform adopted it as standard, don’t you think spam attacks on those platforms would substantially decrease?
— nixon Dec 19, 07:47 am #
We need something new—perhaps this is it.
— Elliott Back Dec 19, 07:56 am #
Perhaps I’m biased, but I think moderation systems (like keyword, URL and IP blacklists) just shift the problem from one place to another. To paraphrase Al Sessions, the time saved managing spam is spent managing spam management software.
On the other hand, it might be possible to turn something like Typekey into a Web of Trust by publishing moderation decisions for others to use; and fetching moderation decisions from other (known, trusted) blogs. i.e. when someone on my blogroll deletes a comment by EvilSpammyPerson, they’re automatically blocked on my site also. (Such a system should only trust moderation decisions from known sites, to prevent poisoning with bogus or malicious moderation)
— Alex Dec 19, 07:59 am #
Pagerank denial can’t hurt, but I don’t think it will have much of an effect. Most people don’t publish referrer links for exactly that reason, but referrer spam is on the rise – even if only 1% of sites pass pagerank, spamming still pays off.
Same deal with email spam. As more people employ better filtering, spammers just counter by increasing the size of their target population.
— Alex Dec 19, 08:03 am #
I’m trying to get a handle on your web of trust idea, and I’m wondering how it would handle someone like my Dad, the guy with the AOL address, the cap lock key engaged, and no history of commenting any where else.
Because a goodly portion of the problem isn’t with high traffic high Page Rank blogs, it’s with the guy who just started a blog a few months ago, and his readers/commenters mostly consist of his even less web-literate family and friends. Trust me, they won’t do Typekey. They won’t jump any hurdle you place in front of a commenter, in hopes of thwarting a spambot. They’ll just not comment.
I still think that there needs to be a triad of solutions, with options for the blog owner (option to moderate, use Typekey, Blacklist, whatever), newly configured code in the apps themselves for this specific issue (forced two-way browser communication to comment, code unique to each install, plugins, etc.), and server level solutions by the web hosts.
— Reid Dec 19, 09:02 am #
Thanks for the comments. Yes, you’re right that a web of trust would create a hurdle similar to that of Typekey. People would have to authenticate themselves somehow, which means some kind of login or key system. From the user’s point of view it’s not really any different to identity authentication like Typekey – it’s what goes on behind the scenes after you’ve logged in that’s different.
I'm not even convinced that a web of trust is workable at all. It's been tried before with limited success, mostly because of usability problems. I mentioned it because, in the long term, it seems to be the only measure that actually tackles the problem of spam directly, by answering the question "who do I trust to write comments on my blog?"
And yes, if a web of trust were used, it ought to be one of several measures in place. A highly trusted user (e.g. someone who can prove they’re on my blogroll) might be able to comment directly, without any other hurdles. Someone who is untrusted, or removed by many degrees of separation, might have to pass manual moderation or a Captcha or something else.
— Alex Dec 19, 09:52 am #
Yes, it’s an arms race but also consider that it also develops evidence for patterns of abuse. Theft of computing services is a crime and it’s about high time folks start getting pinched for it.
— Bill Kearney Dec 19, 09:12 pm #
An IP address and/or referrer check will help defend against simple inlining of the image. But a quick google turns up anecdotal evidence that spammers have already progressed past that, and are now copying and re-serving the images from their own web site.
Bear in mind also that an IP or referrer check will block a few legitimate users who are stuck behind proxies (AOL in particular) and firewalls.
A short timeout probably won’t help, and will also deter some legitimate users. Some people will take longer than 60 seconds to preview and submit their comment. The Boing Boing article points out that porn sites get more than enough traffic to solve captchas in a few seconds.
— Alex Dec 20, 06:37 am #
I’ll shoot down the idea of verifying people are who they claim to be via email as a bad one, for two reasons: Spammers can just take one of their fake domains, forward *@domain to a single bot whose job is to follow links in the email it gets. Secondly, it creates a gateway to an open relay for spammers to email people with.
If any solution exists in fooling the bots, I think it will require masking the captcha/turing test so that the bot doesn’t know where it is. That is, make the finding the Turing Test a Turing Test in and of itself.
Bots can look at HTML source for your comments form and see “Name”, “Email”, “Botfilter” etc fields, and fill them in appropriately. But what if the “Botfilter” field changed its name/location with every page load?
Humans could spot the difference because it’s something presenting themselves to be filled in and they don’t know the names of the form fields (esp. if site authors start using CSS to tint the background colors of required fields to a certain color, say, red), but how would a bot go about spotting that?
I’ve been playing around with some custom web publishing software written for Rails and tackling this sort of problem is tricky, but I don’t see any completely obvious attack, or conversely, any completely obvious discouragement to normal people.
What do you think, Alex?
— Matt Lyon Dec 21, 02:32 am #
Randomizing field identifiers is one of the measures I’ve already been working on, including tinkering with ways of using CSS to hide things from humans. It hadn’t occurred to me to think of those measures as Captchas themselves, though. That’s quite an insight.
I should also point to Odinsdream’s post in this thread:
”..And, as part of registering to view porn… you have to examine someone’s comment and mark it as “Spam” or “Human” !!!”
..which is a very clever solution. Maybe not quite practical in the form described there, but I think it’s the counterexample needed to disprove my argument that Captchas are broken. They’re only broken when it’s people vs. machine.
Following on from your comments on the Textdrive forum: I’m less enthusiastic about the idea of rating systems. For one, they can create almost as much work as they save, and they’re not hugely successful in other areas (e.g. Slashdot, Cloudmark). They also achieve a very different thing: Captchas and the like are objective and content-neutral, but ratings are subjective and concerned primarily with the quality of the content. What happens when the Vast Left Wing Conspiracy and the Vast Right Wing Conspiracy both start tagging each others comments as spam?
Oh, and regarding word math puzzles: there’s already an automated exploit right under our noses=
— Alex Dec 21, 04:26 pm #
Oh, Jeez. And I thought it couldn’t get worse. I was just worried about someone like my Dad getting caught in a “web of trust” where his credit report was one big empty. I hadn’t thought about non-spammers manipulating it to their whims.
Thanks for the nightmare, Alex. Let’s try to avoid that one, eh?
— Reid Dec 21, 09:24 pm #
I forget who said it, but someone somewhere said, “Time spent managing spam software is time spent managing spam.” I couldn’t agree more. With email, I spend less than a minute each week training SpamAssassin, which I find an acceptable trade-off. But I do not expect similarly sophisitcated comment-spam software for some time to come.
One of my favorite snippets off of this disc set is the soundtrack to an old IBM commercial from the 50s, whose mantra is, “Machines should work. People should think.”
Anyrate, the random form field thing is definately a good idea, there are just a few gotchas I’ve been able to come up with:
* There must be some way of validating the random fieldnames. Otherwise, the bot will just generate it’s own. In Rails, I’ve so far been storing this in the session cookie Rails gives you for free, but a way to do this without cookies would be ideal… I think I’ve got an idea on how to do this with its’ “undefined_action” controller actions, but I’d have to spend some time on it.
* It might be a good idea to avoid complete randomness in favor of permutating the strings off of something that will be unique to each install. Say, the name of the site. On the other hand, maybe not, as that could make it easier for bots to figure out what is where.
* For those who care about accessibility, tags will become the next target, if the form field names are incomprehensible. Perhaps these should customized on a per-install basis.
...It sounds like this is the start of a really good idea, and hopefully it will take many years for the bot-writers to catch on.
— Matt Lyon Dec 22, 02:03 am #
I’ve also seen that exploit for CAPTCHAs and thought it was very nefarious.
I think mechanisms like hashcash are the way to go and, with a bit of tweaking, it could be made much more resiliant to a “man in the middle attack”.
The trick would be to have the Javascript that must be run to solve the challenge also take into account the domain name of the blogger’s site. If the javascript was run on a spammer’s site the domain name would resolve differently and so the challenge’s response will be incorrect.
If the domain name can be spoofed more sophisticated techniques could be used like having the hashcash script request information from a secondary PHP script installed on the blogger’s site. This script just returns a hash of the referer which, once again, would be incorrect for a spammer hosting the script in order to trick it.
Interesting times.
— Mark Jun 13, 07:49 pm #