I have a large data set
dim(dt)
[1] 422096 162
where dt is a data.table with a key of tic. I am trying to make a measure for each group of how many missing entries I have. The groups are time series, and dt contains a date column, which is an R date, and a book_lev column, my variable of interest.
This is my code so far:
dt <- dt[sumdt]
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
sublengths <- dt[,list(tslen=length(date)),by=tic, mult="last"]
bt2 <- dt[sublengths, mult="first"]
bt2[, max.year:=extractyear(max.date)]
bt2[, min.year:=extractyear(min.date)]
bt2[, data.fullness:=tslen/(max.year - min.year + 1)]
dt <- dt[bt2]
My idea was that I create this data.fullness value which should equal 1 if there are no holes in the time series. I realize that I may have some NA's in my book_lev column, so I would like to further restrict. Also, in general I am new to data.tables and I would like to see if there are better ways to write what I have just written.
A small sample of the data, which you can load using R's load command, is available here: http://econsteve.com/r/dt_sample.Robj
(First, a caveat. I'm not sure I correctly understood what you want your data.fullness variable to summarize. Based on the dataset you've linked to, I'm taking it to be the proportion of years with some data, in the interval from the first measured year to the last measured year.)
Here is the approach I'd take to the problem as I do understand it:
## FIRST, DEFINE A COUPLE OF FUNCTIONS
extractYear <- function(X) {
as.numeric(format(as.Date(X, format="%m/%d/%Y"), "%Y"))
}
calcFullness <- function(YRS) {
length(unique(YRS))/(diff(range(YRS))+1)
}
## THEN SET TO WORK ON YOUR DATA.TABLE
key(dt) <- "tic"
dt[, year:=extractYear(datadate)]
# Extract summaries for each level of tic
ticSumm <-
dt[, list(min.year = min(year),
max.year = max(year),
data.fullness = calcFullness(year)), by=tic]
ticSumm
# tic min.year max.year data.fullness
# [1,] AMZN 1995 2010 1
# [2,] GM 1950 2010 1
# [3,] XOM 1950 2010 1
# Merge summary back into dt
dt <- dt[ticSumm]
If you have a rectangular data frame and would like to restrict to complete observations, you can create a vector of booleans indicating fully observed rows of data with the complete.cases function. This is assuming you have cleaned data and consistent formatting of missing values using R's NA.
This boolean vector can be used to subset the value directly, or using the subset function.
It's not clear to me from your problem description or sample code how the dt object is formatted, but you may need to use some combination of loops to successfully get 2 dimensional slices of your data where complete.cases can be applied.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With