I've got the following data:
> dput(bla)
structure(list(V1 = structure(c(4L, 4L, 4L, 2L), .Label = c("DDDD",
"EEEE", "NNNN", "PPPP", "ZZZZ"), class = "factor"), V2 = c(100014096L,
100014098L, 100014099L, 100014995L), V3 = c(0.742, 0.779, 0.744,
0.42), V4 = c(1.077, 1.054, 1.049, 0.984), V5 = c(0.662, 0.663,
0.671, 0.487), V6 = c(1.107, 1.14, 1.11, 0.849), V7 = c(0.456,
0.459, 0.459, 1.278)), .Names = c("V1", "V2", "V3", "V4", "V5",
"V6", "V7"), class = "data.frame", row.names = c(NA, 4L))
> bla
V1 V2 V3 V4 V5 V6 V7
1 PPPP 100014096 0.742 1.077 0.662 1.107 0.456
2 PPPP 100014098 0.779 1.054 0.663 1.140 0.459
3 PPPP 100014099 0.744 1.049 0.671 1.110 0.459
4 EEEE 100014995 0.420 0.984 0.487 0.849 1.278
I want to perform a function to achieve the following, currently I'm using aggregate:
> linem<- aggregate(bla[,3:7], list(line=bla$V1),mean, na.rm=T)
> linem
line V3 V4 V5 V6 V7
1 EEEE 0.420 0.984 0.4870000 0.849 1.278
2 PPPP 0.755 1.060 0.6653333 1.119 0.458
To improve the performance of this script I've been trying to get a grasp of data.table to do this. How can I use data.table to get the above output?
I have been trying to do it with data.table, however if there is faster methods like in this question it would be good too.
We convert the 'data.frame' to 'data.table' (setDT(bla)), group by 'V1', specify the columns in the .SDcols, loop over the columns (lapply(.SD,) and get the mean.
library(data.table)
setDT(bla)[, lapply(.SD, mean), by = V1, .SDcols= 3:ncol(bla)]
# V1 V3 V4 V5 V6 V7
#1: PPPP 0.755 1.060 0.6653333 1.119 0.458
#2: EEEE 0.420 0.984 0.4870000 0.849 1.278
A "vectorized" option without lapply but colMeans on .SD:
library(data.table)
setDT(bla)[,as.list(colMeans(.SD[ , 2:6])), V1]
# V1 V3 V4 V5 V6 V7
#1: PPPP 0.755 1.060 0.6653333 1.119 0.458
#2: EEEE 0.420 0.984 0.4870000 0.849 1.278
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With