As the title states, I'm curious if it is possible for the html_text() function from the rvest package to store an NA value if it is not able to find an attribute on a specific page.
I'm currently running a scrape over 199 pages (which works fine; tested on a few variables already).
Currently, when I search for a value that is only present on a some (136) of the 199 pages, html_text() is only returning a vector of 136 strings. This is not useful because without NAs I am unable to determine which pages contained the variable in question.
I see that html_atts() is able to receive a default input, but not html_text(). Any tips?
Thank you so much!
If you create a new function to wrap error handling, it'll keep the %>% pipe cleaner and easier to grok for your future self and others:
library(rvest)
html_text_na <- function(x, ...) {
txt <- try(html_text(x, ...))
if (inherits(txt, "try-error") |
(length(txt)==0)) { return(NA) }
return(txt)
}
base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"
record_id <- c(1291, 1000, 1166, 1232, 999)
sapply(record_id, function(i) {
html(sprintf(base_url, i)) %>%
html_nodes("#drpict tr:nth-child(6) .text") %>%
html_text_na %>%
as.numeric()
})
## [1] 8 NA 10 27 NA
Also, by doing an sapply over the vector of record_id's you automagically get a vector back of whatever value that is you're trying to extract.
Figured it out.
I just needed to add a line of logic to my loop.
Here's a chunk of the code that worked:
for(i in record_id) {
site <- paste("http://www.saem.org/membership/services/residency-directory?RecordID=", i, sep="")
site <- html(site)
this_data <- site %>%
html_nodes("#drpict tr:nth-child(6) .text") %>%
html_text() %>%
as.numeric()
if(length(this_data) == 0) {
this_data <- NA
}
all_data <- c(all_data, this_data)
}
Thanks anyway everybody (and @hrbrmstr)! :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With