Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape JavaScript rendered Website by R?

Just wanna ask if there is any good approach to scrape the website below? https://list.jd.com/list.html?cat=737,794,798&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main

Basically I want to get the name and price of all products However, the price info is stored in some JQuery scripts

Is Selenium the only solution? Thought of using V8 / Jsonlite, but it seems that they are not applicable. It'd be great if you can offer some alternatives in R. (Access to exe files is blocked in my computer, I cannot use Selenium / PhantomJS]

like image 726
Terrence Avatar asked Nov 15 '25 16:11

Terrence


1 Answers

Couldn't find any robots.txt or terms/conditions that bar scraping (if someone does find that please flag in a comment so I can delete the answer):

library(rvest)
library(V8)
library(tidyverse)

pg <- read_html("https://list.jd.com/list.html?cat=737,794,798&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main")

Tagging the question with V8 was a 👍🏼 idea.

ctx <- v8()

We need to add two missing global variables, then evaluate the javascript:

paste0(
  c("var window = {}, SEARCH = {};",
    html_nodes(pg, "script")[[1]] %>%
      html_text()
  ),
  collapse = "\n"
) %>%
  ctx$eval()
## [1] "[object Object]"

Now get some data out:

ctx$get("aosList") %>%
  bind_rows(.id = "id") %>%
  tbl_df()
## # A tibble: 175 x 3
##    id      n                     v         
##    <chr>   <chr>                 <chr>     
##  1 1429810 39-45英寸             244_110017
##  2 1429810 全高清(1920×1080)   3613_77848
##  3 1429810 3级                   1200_1656 
##  4 4286570 39-45英寸             244_110017
##  5 4286570 高清(1366×768)      3613_93579
##  6 4286570 3级                   1200_1656 
##  7 4609652 55英寸                244_1486  
##  8 4609652 4k超高清(3840×2160) 3613_77847
##  9 4609652 3级                   1200_1656 
## 10 4609660 65英寸                244_58269 
## # ... with 165 more rows

And, more data:

ctx$get("attrList") %>%
  bind_rows(.id = "id") %>%
  tbl_df()
## # A tibble: 60 x 15
##    id      IsSam    cw factoryShip isCanUseDQ isJDexpress  isJX isOverseaPurchase mcat3Id soldOS  tssp venderType xgzs 
##    <chr>   <int> <int>       <int>      <int>       <int> <int>             <int>   <int>  <int> <int> <chr>      <chr>
##  1 1429810     0     1           0          0           0     0                 0     798     -1     0 0          7.3  
##  2 4286570     0     1          NA          0           0     0                 0     798     -1     0 0          6.2  
##  3 4609652     0     1          NA          0           0     0                 0     798     -1     0 0          7.5  
##  4 4609660     0     1          NA          0           0     0                 0     798     -1     0 0          8.8  
##  5 4620979     0     1          NA          0           0     0                 0     798     -1     0 0          6.4  
##  6 4751739     0     1          NA          1           0     0                 0     798     -1     0 0          8.9  
##  7 4902977     0     1          NA         NA           0     0                 0     798     -1     0 0          9.5  
##  8 5010925     0     1          NA          1           0     0                 0     798     -1     0 0          8.6  
##  9 5102214     0     1          NA          0           0     0                 0     798     -1     0 0          7.8  
## 10 5218185     0     1          NA          1           0     0                 0     798     -1     0 0          <NA> 
## # ... with 50 more rows, and 2 more variables: isFzxp <int>, shipFareTmplId <int>
like image 57
hrbrmstr Avatar answered Nov 18 '25 07:11

hrbrmstr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!