I've a website where I'm providing email encryption to users and I'm trying to figure out if there's a way to detect if a user is human or a bot. I've been digging into $_SESSION in php but it's easy to bypass, I'm also not interested in captcha, useragent or login solutions, any idea of what I need ?
There are other questions very similar to this one in SO but I couldn't find any straight answer...
Any help will be very welcome, thank you all !
This is a hard problem, and no solution I know of is going to be 100% perfect from a bot-defending and usability perspective. If your attacker is really determined to use a bot on your site, they probably will be able to. If you take things far enough to make it impractical for a computer program to access anything on your site, it's likely no human will want to either, but you can strike a good balance.
My point of view on this is partially as a web developer, but more so from the other side of things, having written numerous web crawler programs for clients all over the world. Not all bots have malicious intent, and can be used for things from automating form submissions to populating databases of doctors office addresses or analyzing stock market data. If your site is well designed from a usability standpoint, there should be no need for a bot that "makes things easier" for a user, but there are cases where there are special needs you can't plan for.
Of course there are those who do have malicious intent, which you definitely want to protect your site against as well as possible. There is virtually no site that can't be automated in some way. Most sites are not difficult at all, but here are a few ideas off the top of my head, from other answers or comments on this page, and from my experience writing (non-malicious) bots.
First I should mention that there are two different categories I would put bots into:
Usually a general-purpose bot is going to be something like a search engine's indexer, or possibly some hacker's script that looks for a form to submit, uses a dictionary attack to search for a vulnerable URL, or something like this. They can also attack "engine sites", such as Wordpress blogs. If your site is properly secured with good passwords and the like, these aren't usually going to pose much of a risk to you (unless you do use Wordpress, in which case you have to keep up with the latest versions and security updates).
Special purpose "personalized" bots are the kind I've written. A bot made specifically for your site can be made to act very much like a human user of your site, including inserting time delays between form submissions, setting cookies, and so on, so they can be hard to detect. For the most part this is the kind I'm talking about in the rest of this answer.
Captchas are probably the most common approach to making sure a user is humanoid, and generally they are difficult to automatically get around. However, if you simply require the captcha as a one-time thing when the user creates an account, for example, it's easy for a human to get past it and then give their shiny new account credentials to a bot to automate usage of the system.
I remember a few years ago reading about a pretty elaborate system to "automate" breaking captchas on a popular gaming site: a separate site was set up that loaded captchas from the gaming site, and presented them to users, where they were essentially crowd-sourced. Users on the second site would get some sort of reward for each correct captcha, and the owners of the site were able to automate tasks on the gaming site using their crowd-sourced captcha data.
Generally the use of a good captcha system will pretty well guarantee one thing: somewhere there is a human who typed the captcha text. What happens before and after that depends on how often you require captcha verification, and how determined the person making a bot is.
If you don't want to use Captchas, this type of verification is probably going to be pretty effective against all but the most determined bot-writer. While (just as with the captcha) it won't prevent an already-verified user from creating and using a bot, you can verify that a human being created the account, and if abused block that phone number/credit-card from being used to create another account.
Sites like Facebook and Craigslist have started using cell-phone verification to prevent spamming from bots. For example, in order to create apps on Facebook, you have to have a phone number on record, confirmed via text message or an automated phone call. Unless your attacker has access to a whole lot of active phone numbers, this could be an effective way to verify that a human created the account and that he only creates a limited number of accounts (one for most people).
Credit cards can also be used to confirm that a human is performing an action and limit the number of accounts a single human can create.
Analyzing your request logs will often reveal bots doing the same actions repeatedly, or sometimes using dictionary attacks to look for holes in your site's configuration. So logs will tell you after-the-fact whether a request was made by a bot or a human. This may or may not be useful to you, but if the requests were made on a cell-phone or credit-card verified account, you can lock the account associated with the offending requests to prevent further abuse.
Math problems or other questions can be answered by a quick google or wolfram alpha search, which can be automated by a bot. Some questions will be harder than others, but the big search companies are working against you here, making their engines better at understanding questions like this, and in turn making this a less viable option for verifying that a user is human.
Some sites employ a mechanism where parameters such as the coordinates of the mouse when they clicked the "submit" button are added to the form submission via javascript. These are extremely easy to fake in most cases, but if you see in your logs a whole bunch of requests using the same coordinates, it's likely they are a bot (although a smart bot could easily give different coordinates with each request).
Since most bots don't load or execute javascript, cookies set using javascript instead of a set-cookie HTTP header will make life slightly more difficult for most would-be bot makers. But not so hard as to prevent the bot from manually setting the cookie as well, once the developer figures out how to generate the same value the javascript generates.
An IP address alone isn't going to tell you if a user is a human. Some sites use IP addresses to try to detect bots though, and it's true that a simple bot might show up as a bunch of requests from the same IP. But IP addresses are cheap, and with Amazon's EC2 service or similar cloud services, you can spawn a server and use it as a proxy. Or spawn 10 or 100 and use them all as proxies.
This is so easy to manipulate in a crawler that you can't count on it to mark a bot that's trying not to be detected. It's easy to set the UserAgent to the same string one of the major browsers sends, and may even rotate between several different browsers.
The most difficult site I ever wrote a bot for consisted of frames within frames within frames....about 10 layers deep, on each page, where each frame's src was the same base controller page, but had different parameters as to which actions to perform. The order of the actions was important, so it was tough to keep straight everything that was going on, but eventually (after a week or so) my bot worked, so while this might deter some bot makers, it won't be useful against all. And will probably make your site about a gazillion times harder to maintain.
Not all bots are "bad". Most of the crawlers/bots I have made were for users who wanted to automate some process on the site, such as data entry, that was too tedious to do manually. So make tedious tasks easy! Or, provide an API for your users. Probably one of the easiest way to discourage someone from writing a bot for your site is to provide API access. If you provide an API, it's a lot less likely someone will go to the effort to create a crawler for it. And you could use API keys to control how heavily someone uses it.
For the purpose of preventing spammers, some combination of captchas and account verification through cell numbers or credit cards is probably going to be the most effective approach. Add some logging analysis to identify and disable any malicious personalized bots, and you should be in pretty good shape.
My favorite way is presenting the "user" with a picture of a cat or a dog and asking, "Is this a cat or a dog?" No human ever gets that wrong; the computer gets it right perhaps 60% of the time (so you have to run it several times). There's a project that will give you bunches of pictures of cats and dogs -- plus, all the animals are available for adoption so if the user likes the pet, he can have it.
It's a Microsoft corporate project, which puts me in a state of cognitive dissonance, as if I found out that Harry Reid likes zydeco music or that George Bush smokes pot. Oh, wait...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With