Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

base::url reads webpage but xml2::read_html gives 404 error

I've encountered a very strange problem while using rvest. This is one of the examples: https://politics.raisethemoney.com/cchristiansen. This pages opens normally in any web browser, and is open-able by base::url.

A connection with                                                              
description "https://politics.raisethemoney.com/cchristiansen"
class       "url-libcurl"                                     
mode        "r"                                               
text        "text"                                            
opened      "closed"                                          
can read    "yes"                                             
can write   "no"  

When xml2::read_html is used, it gives a 404 error.

Error in open.connection(x, "rb") : HTTP error 404.

Tested on both Rstudio Cloud and a local machine (Windows 10). I'm baffled. Any ideas why this may be happening?

like image 489
Kim Avatar asked Dec 22 '25 16:12

Kim


1 Answers

The server is looking for a specific header in the request i.e.

'Accept' : ''

This needs to be provided in order for the request to be given 200 from server. This header is a default one within httr for example but I assume you don't have this with methods you are trying.

Here are some quick tests I ran with Python requests (somewhat similar to rvest):

enter image description here

like image 73
QHarr Avatar answered Dec 24 '25 09:12

QHarr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!