Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conversion with as.date() of csv file false

Tags:

date

r

csv

I have a 1gb csv file with Dates and according values. Now is the Dates are in "undefined Format" - so they are diplayed as numbers in Excel like this:

DATE     FXVol.DKK.EUR,0.75,4  
38719        0.21825                

I cannot open the csv file and change it to the date format I like since I would lose data in this way.

If I now import the data to R and convert the Dates:

as.Date( workingfilereturns[,1], format = "%Y-%m-%d")

It always yields dates that are 70 years + so 2076 instead of 2006. I really have no idea what goes wrong or how to fix this issue.

like image 622
Methamortix Avatar asked Jan 25 '26 06:01

Methamortix


1 Answers

(Note: I have added a note about some quirks in R when dealing with Excel data. You may want to skip directly to that at the bottom; what follows first is the original answer.)

Going by your sample data, 38719 appears to be the number of days which have elapsed since January 1, 1900. So you can just add this number of days to January 1, 1900 to arrive at the correct Date object which you want:

as.Date("1900-01-01") + workingfilereturns[,1]

or

as.Date("1900-01-01") + workingfilereturns$DATE

Example:

> as.Date("1900-01-01") + 38719
[1] "2006-01-04"

Update:

As @Roland correctly pointed out, you could also use as.Date.numeric while specifying an origin of January 1, 1900:

> as.Date.numeric(38719, origin="1900-01-01")
[1] "2006-01-04"

Bug warning:

As the asker @Methamortix pointed out, my solution, namely using January 1, 1900, as the origin, yields a date which is two days too late in R. There are two reasons for this:

  • In R, the origin is indexed with 0, meaning that as.Date.numeric(0, origin="1900-01-01") is January 1, 1900, in R, but Excel starts counting at 1, meaning that formatting the number 1 in Excel as a Date yields January 1, 1900. This explains why R is one day ahead of Excel.
  • (Hold your breath) It appears that Excel has a bug in the year 1900, specifically Excel thinks that February 29, 1900 actually happened, even though 1900 was not a leap year (http://www.miniwebtool.com/leap-years-list/?start_year=1850&end_year=2020). As a result, when dealing with dates greater than February 28, 1900, R is a second day ahead of Excel.


As evidence of this, consider the following code:

> as.Date.numeric(57, origin="1900-01-01")
[1] "1900-02-27"
> as.Date.numeric(58, origin="1900-01-01")
[1] "1900-02-28"
> as.Date.numeric(59, origin="1900-01-01")
[1] "1900-03-01"

In other words, R's as.Date() correctly skipped over February 29th. But type the number 60 into a cell in Excel, format as date, and it will come back as February 29, 1900. My guess is that this has been reported somewhere, possibly on Stack Overflow or elsewhere, but let this serve as another reference point.

So, going back to the original question, the origin needs to be offset by 2 days when dealing with Excel dates in R, where the date is greater than February 28, 1900 (which is the case of the original problem). So he should use his date data frame in the following way:

as.Date.numeric(workingfilereturns$DATE - 2, origin="1900-01-01")

where the date column has been rolled back by two days to sync up with the values in Excel.

like image 130
Tim Biegeleisen Avatar answered Jan 26 '26 21:01

Tim Biegeleisen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!