I am working with the R programming language.
For the following website : https://covid.cdc.gov/covid-data-tracker/ - I am trying to get all versions of this website that are available on WayBackMachine (along with the month and time). The final result should look something like this:
date links time
1 Jan-01-2023 https://web.archive.org/web/20230101000547/https://covid.cdc.gov/covid-data-tracker/ 00:05:47
2 Jan-01-2023 https://web.archive.org/web/20230101000557/https://covid.cdc.gov/covid-data-tracker/ 00:05-57
Here is what I tried so far:
Step 1: First, I inspected the HTML source code (within "elements") and copied/pasted it into a notepad file:

Step 2: Then, I imported this into R and parsed the resulting html for the link structure:
file <- "cdc.txt"
text <- readLines(file)
html <- paste(text, collapse = "\n")
pattern1 <- '/web/\\d+/https://covid\\.cdc\\.gov/covid-data-tracker/[^"]+'
links <- regmatches(html, gregexpr(pattern1, html))[[1]]
But this is not working :
> links
character(0)
Can someone please show me if there is an easier way to do this?
Thanks!
Note:
I am trying to learn how to do this in general (i.e. for any websites on WayBackMachine - the Covid Data Tracker is just an placeholder example for this question)
I realize that there might be much more efficient ways to do this - I open to learning about different approaches!
Archive.org provides Wayback CDX API for looking up captures, it returns timestamps along with original urls in tabular form or JSON. Such queries can be made with read.table() alone, links to specific captures can then be constructed from timestamp and original columns and base URL.
read.table("https://web.archive.org/cdx/search/cdx?url=covid.cdc.gov/covid-data-tracker/&limit=5",
col.names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
colClasses = "character")
#> urlkey timestamp
#> 1 gov,cdc,covid)/covid-data-tracker 20200824224244
#> 2 gov,cdc,covid)/covid-data-tracker 20200825013347
#> 3 gov,cdc,covid)/covid-data-tracker 20200825024622
#> 4 gov,cdc,covid)/covid-data-tracker 20200825042657
#> 5 gov,cdc,covid)/covid-data-tracker 20200825050018
#> original mimetype statuscode
#> 1 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 2 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 3 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 4 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 5 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> digest length
#> 1 APS6SXNXBXCJU3P4N23WH4XCVDVZQYAD 5342
#> 2 XFEMFRGXIPWM4K5F6CBIYDSOFIGCUBQZ 5370
#> 3 TVQKZHRM452CFX4RIORWGSMK5PG3PAPR 5343
#> 4 XZDLPJ6EQIXEO4SUFQTFEX4S6SF7O4GT 5370
#> 5 A4J63TFU7HMZQE5KFTSLBD6EFNZ4IBZ4 5373
To make it a bit more convenient to work with, we can customize API request with httr / httr2, for example, and pass the response through readr / dplyr / lubridate pipeline:
library(dplyr)
library(httr2)
library(readr)
archive_links <- request("https://web.archive.org/cdx/search/cdx") %>%
# set query parameters
req_url_query(
url = "covid.cdc.gov/covid-data-tracker/",
filter = "statuscode:200", # include only succesful captures where HTTP status code was 200
collapse = "timestamp:8", # limit to 1 capt. per day by comparing first 8 digits of timestamp: <20200824>224244
limit = 10, # limit the number of returned values
# output = "json" # request json output, includes column names
) %>%
req_perform() %>%
# pass http response string to read_table() for pasring
resp_body_string() %>%
read_table(col_names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
col_types = cols_only(timestamp = "c",
original = "c",
mimetype = "c",
length = "i")) %>%
mutate(link = paste("https://web.archive.org/web", timestamp, original, sep = "/") %>% tibble::char(shorten = "front"),
timestamp = lubridate::ymd_hms(timestamp)) %>%
select(timestamp, link, length)
archive_links
#> # A tibble: 10 × 3
#> timestamp link length
#> <dttm> <char> <int>
#> 1 2020-08-24 22:42:44 …4224244/https://covid.cdc.gov/covid-data-tracker/ 5342
#> 2 2020-08-25 01:33:47 …5013347/https://covid.cdc.gov/covid-data-tracker/ 5370
#> 3 2020-08-26 02:37:09 …6023709/https://covid.cdc.gov/covid-data-tracker/ 5371
#> 4 2020-08-27 01:05:48 …7010548/https://covid.cdc.gov/covid-data-tracker/ 5703
#> 5 2020-08-28 02:23:26 …8022326/https://covid.cdc.gov/covid-data-tracker/ 31177
#> 6 2020-08-29 02:01:27 …9020127/https://covid.cdc.gov/covid-data-tracker/ 31237
#> 7 2020-08-30 00:06:31 …0000631/https://covid.cdc.gov/covid-data-tracker/ 31218
#> 8 2020-08-31 00:18:29 …1001829/https://covid.cdc.gov/covid-data-tracker/ 31640
#> 9 2020-09-01 02:30:30 …1023030/https://covid.cdc.gov/covid-data-tracker/ 31257
#> 10 2020-09-02 04:08:31 …2040831/https://covid.cdc.gov/covid-data-tracker/ 31654
# first capture:
archive_links$link[1]
#> <pillar_char<[1]>
#> [1] https://web.archive.org/web/20200824224244/https://covid.cdc.gov/covid-data-tracker/
Created on 2023-07-02 with reprex v2.0.2
There are also Archive.org client libraries for R, e.g. https://github.com/liserman/archiveRetriever & https://hrbrmstr.github.io/wayback/ , though the query interface for the first is bit odd, and the other is currently not available through CRAN.
This is really two questions. The html is generated client side rather than server side. This is why you cannot just request the html from R to get what you need, and end up copying and pasting from Developer Tools. You can automate this by using RSelenium. The docs are extensive so I won't cover that in the answer.
You should also use a parser like rvest to parse the html, rather than regular expressions. In this case, to get the output you want, that would look something like:
library(rvest)
url <- "wayback.html"
page <- read_html(url)
# Find correct links
links <- page |>
html_elements("a") |>
html_attr("href") |>
grep("/web/\\d.+/https://covid.cdc.gov/covid-data-tracker/$", x = _, value = T)
# Create dates
dates <- as.Date(
gsub("/web/(\\d{8}).+$", "\\1", links),
format = "%Y%m%d"
)
# Prepend base URL
links <- paste0("https://web.archive.org/", links)
dat <- data.frame(dates, links)
head(dat)
# dates links
# 1 2020-08-24 https://web.archive.org//web/20200824224244/https://covid.cdc.gov/covid-data-tracker/
# 2 2023-06-30 https://web.archive.org//web/20230630234650/https://covid.cdc.gov/covid-data-tracker/
# 3 2023-06-29 https://web.archive.org//web/20230629011221/https://covid.cdc.gov/covid-data-tracker/
# 4 2023-01-01 https://web.archive.org//web/20230101/https://covid.cdc.gov/covid-data-tracker/
# 5 2023-01-02 https://web.archive.org//web/20230102/https://covid.cdc.gov/covid-data-tracker/
# 6 2023-01-03 https://web.archive.org//web/20230103/https://covid.cdc.gov/covid-data-tracker/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With