If you are able to identify requests that originate from the crawler's IP range, you are set. There are two methods of verifying the IP: Some search engines provide IP lists or ranges. You can verify the crawler by matching its IP with the provided list.
Search engines like Google, Bing, and Yahoo use crawlers to properly index downloaded pages so that users can find them faster and more efficiently when they are searching. Without web crawlers, there would be nothing to tell them that your website has new and fresh content.
Make Some of Your Web Pages Not Discoverable Adding a “no index” tag to your landing page won't show your web page in search results. Search engine spiders will not crawl web pages with “disallow” tags, so you can use this type of tag, too, to block bots and web crawlers.
One option to reduce server load from bots, spiders, and other crawlers is to create a robots. txt file at the root of your website. This tells search engines what content on your site they should and should not index.
A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.
I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.
Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.
See Project Honeypot - they're setting up bot traps on large scale (and have DNSRBL with their IPs).
Use tricky URLs and HTML:
<a href="//example.com/"> = http://example.com/ on http pages.
<a href="page&#hash"> = page& + #hash
In HTML you can use plenty of tricks with comments, CDATA elements, entities, etc:
<a href="foo<!--bar-->"> (comment should not be removed)
<script>var haha = '<a href="bot">'</script>
<script>// <!-- </script> <!--><a href="bot"> <!-->
An easy solution is to create a link and make it invisible
<a href="iamabot.script" style="display:none;">Don't click me!</a>
Of course you should expect that some people who look at the source code follow that link just to see where it leads. But you could present those users with a captcha...
Valid crawlers would, of course, also follow the link. But you should not implement a rel=nofollow, but look for the sign of a valid crawler. (like the user agent)
One thing you didn't list, that are used commonly to detect bad crawlers.
Hit speed, good web crawlers will break their hits up so they don't deluge a site with requests. Bad ones will do one of three things:
Also, some offline browsing programs will slurp up a number of pages, I'm not sure what kind of threshold you'd want to use, to start blocking by IP address.
This method will also catch mirroring programs like fmirror or wget.
If the bot randomizes the time interval, you could check to see if the links are traversed in a sequential or depth-first manner, or you can see if the bot is traversing a huge amount of text (as in words to read) in a too-short period of time. Some sites limit the number of requests per hour, also.
Actually, I heard an idea somewhere, I don't remember where, that if a user gets too much data, in terms of kilobytes, they can be presented with a captcha asking them to prove they aren't a bot. I've never seen that implemented though.
Update on Hiding LinksAs far as hiding links goes, you can put a div under another, with CSS (placing it first in the draw order) and possibly setting the z-order. A bot could not ignore that, without parsing all your javascript to see if it is a menu. To some extent, links inside invisible DIV elements also can't be ignored without the bot parsing all the javascript.
Taking that idea to completion, uncalled javascript which could potentially show the hidden elements would possilby fool a subset of javascript parsing bots. And, it is not a lot of work to implement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With