I would like to partition panel data and preserve the panel nature of the data:
      library(caret)
      library(mlbench)
      #example panel data where id is the persons identifier over years
      data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
                    header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
      ## Here for instance the dependent variable is working
      inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)
      # subset into training
      training <- data[ inTrain,]
      # subset into testing
      testing <- data[-inTrain,]
      # Here we see some intersections of identifiers 
      str(training$id[10:20])
      str(testing$id)
However I would like, when partitioning or sampling the data, to avoid that the same person (id) is splitted into two data sets.Is their a way to randomly sample/partition from the data an assign indivuals to the corresponding partitions rather then observations?
I tried to sample:
    mysample <- data[sample(unique(data$id), 1000,replace=FALSE),] 
However, that destroys the panel nature of the data...
I think there's a little bug in the sampling approach using sample(): It is using the id variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:
nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ] 
testing <- data[!data$id %in% inTrainID, ] 
head(training[, 1:5], 10)
#    id FEMALE YEAR AGE   HANDDUM
# 1   1      0 1984  54 0.0000000
# 2   1      0 1985  55 0.0000000
# 3   1      0 1986  56 0.0000000
# 8   3      1 1984  58 0.1687193
# 9   3      1 1986  60 1.0000000
# 10  3      1 1987  61 0.0000000
# 11  3      1 1988  62 1.0000000
# 12  4      1 1985  29 0.0000000
# 13  5      0 1987  27 1.0000000
# 14  5      0 1988  28 0.0000000
dim(data)
# [1] 27326    41
dim(training)
# [1] 20566    41
dim(testing)
# [1] 6760   41
20566/27326
### 75.26% were selected for training
Let's check class balances, because createDataPartition would keep the class balance for WORKING equal in all sets.
table(data$WORKING) / nrow(data)
#         0         1 
# 0.3229525 0.6770475 
#
table(training$WORKING) / nrow(training)
#         0         1 
# 0.3226685 0.6773315 
#
table(testing$WORKING) / nrow(testing)
#         0         1 
# 0.3238166 0.6761834 
### virtually equal
I thought I would point out caret's groupKFold function for anyone looking at this, which would be handy for cross validation with this class of data. From the documentation: "To split the data based on groups, groupKFold can be used:
set.seed(3527)
subjects <- sample(1:20, size = 80, replace = TRUE)
folds <- groupKFold(subjects, k = 15) 
The results in folds can be used as inputs into the index argument of the trainControl function."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With