Checking for missing items in a ragged panel stored in a data.table

Question

I have a large data set

 dim(dt)
 [1] 422096    162

where dt is a data.table with a key of tic. I am trying to make a measure for each group of how many missing entries I have. The groups are time series, and dt contains a date column, which is an R date, and a book_lev column, my variable of interest.

This is my code so far:

dt <- dt[sumdt]
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]

sublengths <- dt[,list(tslen=length(date)),by=tic, mult="last"]
bt2 <- dt[sublengths, mult="first"]
bt2[, max.year:=extractyear(max.date)]
bt2[, min.year:=extractyear(min.date)]
bt2[, data.fullness:=tslen/(max.year - min.year + 1)]

dt <- dt[bt2]

My idea was that I create this data.fullness value which should equal 1 if there are no holes in the time series. I realize that I may have some NA's in my book_lev column, so I would like to further restrict. Also, in general I am new to data.tables and I would like to see if there are better ways to write what I have just written.

A small sample of the data, which you can load using R's load command, is available here: http://econsteve.com/r/dt_sample.Robj

Josh O'Brien · Accepted Answer

(First, a caveat. I'm not sure I correctly understood what you want your data.fullness variable to summarize. Based on the dataset you've linked to, I'm taking it to be the proportion of years with some data, in the interval from the first measured year to the last measured year.)

Here is the approach I'd take to the problem as I do understand it:

## FIRST, DEFINE A COUPLE OF FUNCTIONS

extractYear <- function(X) {
    as.numeric(format(as.Date(X, format="%m/%d/%Y"), "%Y"))
}

calcFullness <- function(YRS) {
    length(unique(YRS))/(diff(range(YRS))+1)
}

## THEN SET TO WORK ON YOUR DATA.TABLE

key(dt) <- "tic"
dt[, year:=extractYear(datadate)]

# Extract summaries for each level of tic
ticSumm <- 
    dt[, list(min.year = min(year),
              max.year = max(year),
              data.fullness = calcFullness(year)), by=tic]
ticSumm
#       tic min.year max.year data.fullness
# [1,] AMZN     1995     2010             1
# [2,]   GM     1950     2010             1
# [3,]  XOM     1950     2010             1


# Merge summary back into dt
dt <- dt[ticSumm]

AdamO · Answer

If you have a rectangular data frame and would like to restrict to complete observations, you can create a vector of booleans indicating fully observed rows of data with the complete.cases function. This is assuming you have cleaned data and consistent formatting of missing values using R's NA.

This boolean vector can be used to subset the value directly, or using the subset function.

It's not clear to me from your problem description or sample code how the dt object is formatted, but you may need to use some combination of loops to successfully get 2 dimensional slices of your data where complete.cases can be applied.

Checking for missing items in a ragged panel stored in a data.table

Tags:

r

data.table

stevejb

2 Answers

Josh O'Brien

AdamO

Recent Activity

Donate For Us

Checking for missing items in a ragged panel stored in a data.table

Tags:

r

data.table

stevejb

2 Answers

Josh O'Brien

AdamO

Related questions

Recent Activity

Donate For Us