Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping with Rvest -- Return NA if node is not found?

Tags:

r

rvest

I am a bit stuck here. I would like to scrape data from a website, and extract a few things like user ratings, comments etc. I am trying to add the data to a data frame.

Below is the code i have so far:

# Read html and select the URLs for each game review. 

library(rvest)
library(dplyr)
library(plyr)

# Read the webpage and the number of ratings.

getGame <- function(metacritic_game) {

total_ratings<- metacritic_game %>%
  html_nodes("strong") %>%
  html_text()

total_ratings <- ifelse(length(total_ratings) == 0, NA, 
as.numeric(strsplit(total_ratings, " ") [[1]][1]))

# Get the game title and the platform.

game_title <- metacritic_game %>%
  html_nodes("h1") %>%
  html_text()

game_platform <- metacritic_game %>%
  html_nodes(".platform a") %>%
  html_text()

game_platform <- strsplit(game_platform," ")[[1]][57:58]
game_platform <- gsub("\n","", game_platform)
game_platform<- paste(game_platform[1], game_platform[2], sep = " ")

game_publisher <- metacritic_game %>%
  html_nodes(".publisher a:nth-child(1)") %>%
  html_attr("href") %>%
  strsplit("/company/")%>%
  unlist() 

game_publisher <- gsub("\\W", " ", game_publisher)
game_publisher <- strsplit(game_publisher,"\\t")[[2]][1]

release_date <- metacritic_game %>%
  html_nodes(".release_data .data") %>%
  html_text()


user_ratings <- metacritic_game %>%
  html_nodes("#main .indiv") %>%
  html_text() %>%
  as.numeric()


user_name <- metacritic_game %>%
  html_nodes(".name a") %>%
  html_text()



review_date <- metacritic_game %>%
  html_nodes("#main .date") %>%
  html_text()


user_comment <- metacritic_game %>%
  html_nodes("#main .review_section .review_body") %>%
  html_text()



record_game <- data.frame(game_title = game_title,
                      game_platform = game_platform,
                      game_publisher = game_publisher,
                      username = user_name,
                      ratings =  user_ratings,
                      date = review_date,
                      comments = user_comment)

}

metacritic_home <-read_html("https://www.metacritic.com/browse/games/score/metascore/90day/all/filtered")

game_urls <- metacritic_home %>%
  html_nodes("#main .product_title a") %>%
  html_attr("href")

get100games <- function(game_urls) {
  data <- data.frame()
  i = 1
  for(i in 1:length(game_urls)) {
    metacritic_game <- read_html(paste0("https://www.metacritic.com", 
game_urls[i], "/user-reviews"))
    record_game <- getGame(metacritic_game)
    data <-rbind.fill(data, record_game)
    print(i)
  }
  data
}

df100games <- get100games(game_urls)

Some of the links, though, do not have any user reviews and as a result rvest is not able to find the node and I get the following error: Error in data.frame(game_title = game_title, game_platform = game_platform, : arguments imply differing number of rows: 1, 0.

I have tried to include ifelse statements such as:

username = ifelse(length(user_name) !=0 , user_name, NA),
                      ratings =  ifelse(length(user_ratings) != 0, 
user_ratings, NA),
                      date = ifelse(length(review_date) != 0, 
review_date, NA),
                      comments = ifelse(length(user_comment) != 0, 
user_comment, NA))

However, the data frame only returns one review per game instead of returning all the reviews.. Any thoughts on this?

Thanks..!

like image 468
J.C. Avatar asked Oct 22 '25 05:10

J.C.


1 Answers

You can use the function operator possibly form the purrr package:

df100games <- purrr::map(game_urls, purrr::possibly(get100games, NULL)) %>%
  purrr::compact() %>% 
  dplyr::bind_rows()

I believe this will return your desired output.

like image 106
dave-edison Avatar answered Oct 24 '25 19:10

dave-edison