Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse date column with mix of date formats

Tags:

date

r

lubridate

I have to combine several spreadsheets for my project and for some reason they all decided to enter the dates differently. Below is a sample of my data with all of the different date formats I got after reading in and combining all of the spreadsheets.

date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018

I'm trying to standardize everything and turn it into a yyyy-mm-dd format. This is my attempt below.

library(lubridate)
parse_date_time(x = df$date,
                orders = c("Y m d, "d m y", "d B Y", "m/d/y", ),
                locale = "eng")

The problem is that the moment I reach a row with a different format, it stops working and just gives me NA's for the rest of the rows. How do I fix this?

Expected output

date        new_date
13-MAR-18   2018-03-13
2018-08-05  2018-08-05
43423       2018-11-19
11-Mar-2019 2019-03-11
10/16/2018  2018-10-16
like image 570
Emma N Avatar asked Dec 14 '25 22:12

Emma N


2 Answers

Update: Row 2 was wrong! As I mentioned one issue will be the order of the orders argument of parse_date_time:

Now it should work:

library(dplyr)
library(stringr)
library(parsedate)

df %>%   
  mutate(x = parse_date(date)) %>% 
  mutate(y = as.integer(str_extract(date, '^\\d+$')),
         y = as.Date(y, origin = "1899-12-30"),
         x = as.Date(x),
         y = coalesce(y,x)) %>% 
  select(date, new_date=y)

         date   new_date
1   13-MAR-18 2018-03-13
2  2018-08-05 2018-08-05
3       43423 2018-11-19
4 11-Mar-2019 2019-03-11
5  10/16/2018 2018-10-16

First answer: In my point of view there are 3 challenges.

First of all it is difficult to feed the orders argument of parse_date_time. I think after applying arrange we could get some control.

Second, the integer type of date must be handled separately I think.

Finally, it is not clear which is month in "2018-08-05" it could be 8 or 5.

library(lubridate)
library(dplyr)
df %>% 
  arrange(date) %>% 
  mutate(x = parse_date_time(date, orders=c("mdy", "dmy", "ymd")),
         y = as.integer(str_extract(date, '^\\d+$')),
         y = as.Date(y, origin = "1899-12-30"),
         x = coalesce(x,y)) %>% 
  select(date, new_date=x)
 date   new_date
1  10/16/2018 2018-10-16
2 11-Mar-2019 2019-11-20
3   13-MAR-18 2018-03-13
4  2018-08-05 2018-08-05
5       43423 2018-11-19
Warning message:
Problem while computing `x = parse_date_time(date, orders =
c("mdy", "dmy", "ymd"))`.
i  1 failed to parse. 
like image 120
TarJae Avatar answered Dec 16 '25 15:12

TarJae


Edit:

I've just realised that the conversion for "43423" to "2022-07-27 16:02:42" is incorrect using this approach (2022-07-27 is today's date), as further explained in the docs: https://github.com/gaborcsardi/parsedate

For this solution to work, you would need to handle 'integer' date formats first, then parsedate() the remaining dates. I think @TarJae's solution is the way to go.

Original answer:

Here is a potential solution using the parsedate package:

#install.packages("parsedate")
library(parsedate)

df <- read.table(text = "date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018", header = TRUE)

df$date_parsed <- parse_date(df$date)
df
#>          date         date_parsed
#> 1   13-MAR-18 2018-03-13 00:00:00
#> 2  2018-08-05 2018-08-05 00:00:00
#> 3       43423 2022-07-27 16:02:42
#> 4 11-Mar-2019 2019-03-11 00:00:00
#> 5  10/16/2018 2018-10-16 00:00:00

Created on 2022-07-27 by the reprex package (v2.0.1)

like image 27
jared_mamrot Avatar answered Dec 16 '25 16:12

jared_mamrot



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!