I am currently experimenting with data.table and looking for the 'optimal' way to to do things.
What I want to do in the following example is have a string with column names, append ".d" for normal deltas and append ".dP" for percentage deltas. (Bear in mind that the columns in the string are only a subset and not the full data.table even though my example is limited to these.)
I want the code to be as compact and fast as possible, using as much of the benefits of both R and data.table.
The solution that I have currently come up with is this:
percentDelta<-function(newvalue,basevalue){
return((newvalue-basevalue)/basevalue)
}
normalDelta<-function(newvalue,basevalue){
return(newvalue-basevalue)
}
DT = data.table(price=c(2,3,4,5,6,7,8), otherMetric=c(3,4,5,1,3,4,5))
deltaColsNames = c("otherMetric")
deltaColsNewNames <- paste0(deltaColsNames,'.d')
percentColsNewNames <- paste0(deltaColsNames,'.dP')
DT[,eval(deltaColsNewNames) := lapply(DT[,eval(deltaColsNames),with=F],normalDelta,price)]
DT[,eval(percentColsNewNames) := lapply(DT[,eval(deltaColsNames),with=F],percentDelta,price)]
I am not quite sure if the data.table calls to generate multiple columns is correct there? Is using "lapply" with "eval" the way one would go about this?
EDIT: Should I avoid the use of "with=F"?
DT[,eval(deltaColsNewNames) := lapply(DT[,which(names(DT) %in% deltaColsNames)],normalDelta,price)]
DT[,eval(percentColsNewNames) := lapply(DT[,which(names(DT) %in% deltaColsNames)],percentDelta,price)]
Issue #495 is solved now with this recent commit, we can now do this just fine:
require(data.table) # v1.9.7+
DT[, (deltaColsNewNames) := lapply(.SD, normalDelta, price), .SDcols=deltaColsNames]
So for answering the question and adding the optimisations from the comments, the following answers arose:
require(data.table) #version 1.9.5 from github needed!
normalDelta<-function(newvalue,basevalue){
return(newvalue-basevalue)
}
DT = data.table(price=rep(c(3,4,5),each=200000000), otherMetric=sample(c(1,3,6),200000000,T))
deltaColsNames = c("otherMetric")
deltaColsNewNames <- paste0(deltaColsNames,'.d')
Scenario 1, using "eval" and "with=F":
system.time(DT[,(deltaColsNewNames) := lapply(DT[,eval(deltaColsNames),with=F],normalDelta,price)])
# user system elapsed
#2.134 1.747 3.880
Scenario 2, using "which(names) %in%" to avoid strings as column indexes:
system.time(DT[,(deltaColsNewNames) := lapply(DT[,which(names(DT) %in% deltaColsNames)],normalDelta,price)])
#user system elapsed
#1.652 1.105 2.756
Scenario 3, using ".SD" syntax and eval() in 1.9.5 (in 1.9.4, this was slower):
system.time(DT[,(deltaColsNewNames) := lapply(.SD[, eval(deltaColsNames),with=F], normalDelta, price)])
#user system elapsed
#2.148 1.847 4.764
Scenario 4, using ".SD" syntax and which() in 1.9.5 (in 1.9.4, this was also slower):
system.time(DT[,(deltaColsNewNames) := lapply(.SD[, which(names(DT) %in% deltaColsNames)], normalDelta, price)])
#user system elapsed
#1.701 1.117 2.817
Scenario 5, using mget():
system.time(DT[, (deltaColsNewNames) := lapply(mget(deltaColsNames), normalDelta, price)])
#user system elapsed
#1.426 1.166 2.591
Scenario 6: mget and .SD combined:
system.time(DT[, (deltaColsNewNames) := lapply(.SD[, mget(deltaColsNames)], normalDelta, price)])
#user system elapsed
#2.149 1.788 4.974
UPDATE: After increasing the size of the dataset: Scenario 2&4&5 are coming out quite ahead. However, scenario 5 has a much higher memory footprint than 2&4, as I ran into memory issues on my laptop when testing this with a bigger dataset (see updated results above)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With