Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a date from a column and adding the year if missing in R

I am trying to extract dates from text and create a new column in a dataset. Dates are entered in different formats in column A1 (either mm-dd-yy or mm-dd). I need to find a way to identify the date in column A1 and then add the year if it is missing. Thus far, I have been able to extract the date regardless of the format; however, when I use as.Date on the new column A2, the date with mm-dd format becomes <NA>. I am aware that there might not be a direct solution for this situation, but a workaround (generalizable to a larger data set) would be great. The year would go from September 2019 to August 2020. Additionally, I am not sure why the format I use within the as.Date function is unable to control how the date gets displayed. This latter issue is not that important, but I am surprised by the behavior of the R function. A solution in tidyverse would be much appreciated.

library(tidyverse)
library(stringr)
    
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3")) 

db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+")) 
#                A1      A2
#1     review 11/18   11/18
#2   begins 12/4/19 12/4/19
#3           3/5/20  3/5/20
#4             <NA>    <NA>
#5 deadline 09/5/19 09/5/19
#6              9/3     9/3
    
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+")) %>% 
       mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))

 #               A1         A2
 #   1     review 11/18       <NA>
 #   2   begins 12/4/19 2019-12-04
 #   3           3/5/20 2020-03-05
 #   4             <NA>       <NA>
 #   5 deadline 09/5/19 2019-09-05
 #   6              9/3       <NA>
like image 582
Michael Matta Avatar asked Dec 30 '25 17:12

Michael Matta


1 Answers

Perhaps:

library(tidyverse)

db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3")) 

#year from september to august 2019

(db <- 
 db %>% 
  mutate(A2 = str_extract(A1, '[\\d\\d/]+'),
         A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d+')) > 8, paste0(A2, '/19'), A2),
         A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d+')) <= 8, paste0(A2, '/20'), A2),
         A2 = as.Date(A2, "%m/%d/%y")) )             
#>                 A1         A2
#> 1     review 11/18 2019-11-18
#> 2   begins 12/4/19 2019-12-04
#> 3           3/5/20 2020-03-05
#> 4             <NA>       <NA>
#> 5 deadline 09/5/19 2019-09-05
#> 6              9/3 2019-09-03

Created on 2021-11-21 by the reprex package (v2.0.1)

like image 68
jpdugo17 Avatar answered Jan 02 '26 09:01

jpdugo17



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!