Detecting 'stealth' web-crawlers

Tags:

web-crawler

People also ask

How do you identify a crawler?

If you are able to identify requests that originate from the crawler's IP range, you are set. There are two methods of verifying the IP: Some search engines provide IP lists or ranges. You can verify the crawler by matching its IP with the provided list.

Does Yahoo use web crawlers?

Search engines like Google, Bing, and Yahoo use crawlers to properly index downloaded pages so that users can find them faster and more efficiently when they are searching. Without web crawlers, there would be nothing to tell them that your website has new and fresh content.

How do you prevent web crawlers?

Make Some of Your Web Pages Not Discoverable Adding a “no index” tag to your landing page won't show your web page in search results. Search engine spiders will not crawl web pages with “disallow” tags, so you can use this type of tag, too, to block bots and web crawlers.

How can I control bots spiders and crawlers?

One option to reduce server load from bots, spiders, and other crawlers is to create a robots. txt file at the root of your website. This tells search engines what content on your site they should and should not index.

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.

See Project Honeypot - they're setting up bot traps on large scale (and have DNSRBL with their IPs).

Use tricky URLs and HTML:

<a href="//example.com/"> = http://example.com/ on http pages.
<a href="page&amp;&#x23;hash"> = page& + #hash

In HTML you can use plenty of tricks with comments, CDATA elements, entities, etc:

<a href="foo<!--bar-->"> (comment should not be removed)
<script>var haha = '<a href="bot">'</script>
<script>// <!-- </script> <!--><a href="bot"> <!-->

An easy solution is to create a link and make it invisible

<a href="iamabot.script" style="display:none;">Don't click me!</a>

Of course you should expect that some people who look at the source code follow that link just to see where it leads. But you could present those users with a captcha...

Valid crawlers would, of course, also follow the link. But you should not implement a rel=nofollow, but look for the sign of a valid crawler. (like the user agent)

One thing you didn't list, that are used commonly to detect bad crawlers.

Hit speed, good web crawlers will break their hits up so they don't deluge a site with requests. Bad ones will do one of three things:

hit sequential links one after the other
hit sequential links in some paralell sequence (2 or more at a time.)
hit sequential links at a fixed interval

Also, some offline browsing programs will slurp up a number of pages, I'm not sure what kind of threshold you'd want to use, to start blocking by IP address.

This method will also catch mirroring programs like fmirror or wget.

If the bot randomizes the time interval, you could check to see if the links are traversed in a sequential or depth-first manner, or you can see if the bot is traversing a huge amount of text (as in words to read) in a too-short period of time. Some sites limit the number of requests per hour, also.

Actually, I heard an idea somewhere, I don't remember where, that if a user gets too much data, in terms of kilobytes, they can be presented with a captcha asking them to prove they aren't a bot. I've never seen that implemented though.

Update on Hiding Links

As far as hiding links goes, you can put a div under another, with CSS (placing it first in the draw order) and possibly setting the z-order. A bot could not ignore that, without parsing all your javascript to see if it is a menu. To some extent, links inside invisible DIV elements also can't be ignored without the bot parsing all the javascript.

Taking that idea to completion, uncalled javascript which could potentially show the hidden elements would possilby fool a subset of javascript parsing bots. And, it is not a lot of work to implement.

Related questions
                            
                                Python: Disable images in Selenium Google ChromeDriver
                            
                                How to do HTTP-request/call with JSON payload from command-line?
                            
                                Detect Search Crawlers via JavaScript
                            
                                Python: maximum recursion depth exceeded while calling a Python object
                            
                                How do you archive an entire website for offline viewing?
                            
                                Change IP address dynamically?
                            
                                Click a Button in Scrapy
                            
                                How to write a crawler?
                            
                                How do I make a simple crawler in PHP? [closed]
                            
                                getting Forbidden by robots.txt: scrapy
                            
                                Spider a Website and Return URLs Only
                            
                                Anyone know of a good Python based web crawler that I could use?
                            
                                PyPi download counts seem unrealistic
                            
                                crawler vs scraper
                            
                                Designing a web crawler
                            
                                Search in html source with GOOGLE? [closed]
                            
                                How to run Scrapy from within a Python script
                            
                                Hide Email Address from Bots - Keep mailto:
                            
                                How can I use different pipelines for different spiders in a single Scrapy project
                            
                                What is the difference between web-crawling and web-scraping? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Detecting 'stealth' web-crawlers

Tags:

web-crawler

People also ask

Related questions

Recent Activity

Donate For Us