I have a problem that I encounter regularly, and I need a more efficient way of dealing with. I have a messy solution that is posted below.
First, I'll generate some example data that is similar to my datasets.
a <- c(1, 2, 2, 2, 3, 3)
b <- c("10/12", "10/12", "10/12", "10/13", "10/12", "10/12")
c <- c("c", "c", "pv", "c", "c", "c")
data <- matrix(NA, nrow = 6, ncol = 3)
data[,1] <- a
data[,2] <- b
data[,3] <- c
data
[,1] [,2] [,3]
[1,] 1 10/12 c
[2,] 2 10/12 c
[3,] 2 10/12 pv
[4,] 2 10/13 c
[5,] 3 10/12 c
[6,] 3 10/12 c
# [,1] is a unique identifier, [,2] is a date, and [,3] is a type of occurrance
What I need to do is generate a table that includes only one entry for each ID for each day with a column showing whether that entry corresponds to 'c' only, 'pv' only, 'c & pv', or 'multiple c'. Multiple pvs are not possible in the data
The way I have done this is using a nested for loop:
# I generate an object to post the data to
output.temp <- matrix(NA, nrow = 1, ncol = 4)
# Then I define the outer loop that subsets the data over each ID
ids <- unique(data[,1])
n.ids <- length(ids)
for(i in 1:n.ids){
temp.data <- subset(data, data[,1] == ids[i])
dates <- unique(temp.data[,2])
n.dates <- length(dates)
# Then I define the inner loop that subsets the data for each ID over each date
for(j in 1: n.dates){
date.data <- subset(temp.data, temp.data[,2] == dates[j])
# Then I apply the logic of what to write out
if(nrow(date.data) == 1){
if(date.data[,3] == 'c'){
new.row <- cbind(date.data, "c only")
output.temp <- rbind(output.temp, new.row)
}
if(date.data[,3] == 'pv'){
new.row <- cbind(date.data, "pv only")
output.temp <- rbind(output.temp, new.row)
}
}
if(nrow(date.data) > 1){
if('pv' %in% date.data[,3]){
new.row <- cbind(matrix(date.data[1,], nrow = 1), c("c & pv"))
output.temp <- rbind(output.temp, new.row)
}
else{
new.row <- cbind(matrix(date.data[1,], nrow = 1), " multiple c only")
output.temp <- rbind(output.temp, new.row)
}
}
}
}
# Finally, I drop the unnecessary row and column from the output object
output.final <- output.temp[-1,-3]
This works, but it is terribly inefficient. As my datasets become larger (approaching 1 million rows), it becomes more and more of a problem.
Since I am really new to R and have little experience with programming, any advice on an alternate strategy would be greatly appreciated.
You should be able to use the code below to get to the exact format of output you require.
dataset <- data.table(dataset)
setnames(dataset, c('id','day','occurrence'))
dataset[,list(noofc = table(occurrence)['c'], noofpv = table(occurrence)['pv']), by = c('id','day')]
data.tables are very efficient data frames and should help with your data size problem as well
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With