Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to retrieve text below titles from google search using rvest

Tags:

r

rvest

This is a follow up question for this one:

How to retrieve titles from google search using rvest

In this time I am trying to get the text behind titles in google search (circled in red):

enter image description here

Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.

The answer by @AllanCameron is very useful but I do not know how to modify it:

library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>% 
  html_text()

Many thanks for your help!

like image 306
user007 Avatar asked Oct 28 '25 22:10

user007


1 Answers

This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.

It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:

library(rvest)
library(tidyverse)

url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'

google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
                  'https://www.google.com/',
                  '&gl=GB&m=0&pc=shp&x=5&src=2',
                  '&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
                  'set_eom=false&set_aps=true&set_sc=true'), 
           handle = google_handle)
url <- httr::GET(url, handle = google_handle)

For the OP and those without a Google consent page, the set up is simply:

library(rvest)
library(tidyverse)

url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'

Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)

title <- "//div/div/div/a/h3"
text  <- paste0(title, "/parent::a/parent::div/following-sibling::div")

Now we can just apply these xpaths to get the correct nodes and extract the text from them:

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#>   title                                text                                    
#>   <chr>                                <chr>                                   
#> 1 "Mario García Torres - Wikipedia"    "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (@mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~
like image 191
Allan Cameron Avatar answered Oct 31 '25 15:10

Allan Cameron



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!