Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many results does Google allow a request to scrape?

The following PHP code works fine, but when it is used to scrape 1000 Google results for a specified keyword, it only returns 100 results. Does Google have a limit on results returned, or is there a different problem?

<?php
require_once ("header.php");
$data2 = getContent("http://www.google.de/search?q=auch&hl=de&num=100&gl=de&ix=nh&sourceid=chrome&ie=UTF-8");
    $dom = new DOMDocument();
    @$dom->loadHtml($data2);
    $xpath = new DOMXPath($dom);

    $hrefs = $xpath->evaluate("//div[@id='ires']//li/h3/a/@href");
    $j = 0;

    foreach ($hrefs as $href)
    {            

        $url = "http://www.google.de/" . $href->value . "";
        echo "<b>";

        echo "$j ";
      echo   $url = get_string_between($url, "http://www.google.de//url?q=", "&sa=");
      echo "<br/>";

      $j++;
        }
?>
like image 331
Zeid Selimovic Avatar asked Sep 05 '25 23:09

Zeid Selimovic


2 Answers

You already accepted an answer, anyway if you are still on your project:

As people noted, Google does not like to be scraped. It's not allowed by their terms, so if you agreed to them you break them by automatically accessing the site. However, Google itself did not care about permission to access websites when they started. Even Bing was powered by Google and got caught doing that, I guess most other search engines also borrow from Google.

If you must scrape Google, keep the rate below their detection ratio. Don't hammer them as this only will get your project grounded and Google get more concerned about automated accesses which can make it harder for us in general.

From my experience you can access Google at a rate of 15 up to 20 requests per hour (with one IP) longterm without getting blocked. Of course your code needs to simulate a browser and behave properly. Higher rates will get you blocked, first (usually) by a temporary captcha. Solving the captcha creates a cookie which allows you to continue. I have seen longterm captchas and I have seen permanent blocks of one IP and of large subnets. So rule #1: Do not get detected, if you get detected then automatically stop your scraper.

So it is a bit tricky but if you rely on getting the data out that way, take a look at the open source PHP project at http://scraping.compunect.com/ That's a PHP code which can scrape multiple keywords and multiple pages and manages IP addresses so they do not get blocked. I am using that code for projects, it works so far.

If you just need to gather a small amount of data from Google and the real ranking is not important, take a look at their API. If ranking matters or if you need a lot of data you'll need a Google scraper like the one I linked.

Btw, PHP is quite well suited for the task but you should run it as a local script and not through Apache.

like image 125
John Avatar answered Sep 08 '25 23:09

John


How many results does Google allow a request to scrape?

Zero. You're allowed to scrape zero pages.

Please refer to clause 5.3 of the Google Terms of Service:

"You specifically agree not to access (or attempt to access) 
any of the Services through any automated means (including use 
of scripts or web crawlers)..."

You can try to evade their detection mechanisms; googling "scrape google search" turns up several suggested techniques. But this is not something google supports.

like image 39
Frank Farmer Avatar answered Sep 08 '25 23:09

Frank Farmer