I would like to calculate some summary statistics and perform different regressions by group within a data table, and have the results in "wide" format (i.e. one row per group with several columns). I can do it in multiple steps, but it seems like it should be possible to do all at once.
Consider this example data:
set.seed=46984
dt <- data.table(ID=c(rep('Frank',5),rep('Tony',5),rep('Ed',5)), y=rnorm(15), x=rnorm(15), z=rnorm(15),key="ID")
dt
#       ID          y          x            z
# 1:    Ed  0.2129400 -0.3024061  0.845335632
# 2:    Ed  0.4850342 -0.5159197 -0.087965415
# 3:    Ed  1.8917489  1.7803220  0.760465271
# 4:    Ed -0.4330460 -2.1720944  0.973812545
# 5:    Ed  0.7685060  0.7947470  1.279761200
# 6: Frank  0.4978475 -0.2906851  0.568101004
# 7: Frank  0.6323386 -0.5596599  1.537133025
# 8: Frank -0.8243218 -0.4354885  0.057818033
# 9: Frank  1.2402488  0.3229422  0.005995249
#10: Frank  0.2436210 -0.2651422  0.349532173
#11:  Tony  0.4179568  0.1418463  0.142380549
#12:  Tony  0.7036613  0.4402572  0.141237901
#13:  Tony -0.1978720 -0.9553784  0.480425820
#14:  Tony -1.7269375 -0.1881292  0.370583351
#15:  Tony  1.1064903  0.4375014 -0.798221750
Let's say I want to get the median by ID, perform linear regression on y ~ x by ID, and perform linear regression on y ~ x + z by ID. Here I get the median:
dt.med <- dt[,list(y.med=median(y)),by=ID]
dt.med
#      ID     y.med
#1:    Ed 0.4850342
#2: Frank 0.4978475
#3:  Tony 0.4179568
And thanks to this answer by @DWin, here I get the two individual sets of regression coefficients as columns by ID:
dt.reg.1 <- dt[,as.list(coef(lm(y ~ x))), by=ID]
dt.reg.1
#      ID (Intercept)         x
#1:    Ed  0.63057884 0.5482373
#2: Frank  0.69720351 1.3813007
#3:  Tony  0.08588421 1.0179131
dt.reg.2 <- dt[,as.list(coef(lm(y ~ x + z))), by=ID]
dt.reg.2
#      ID (Intercept)         x          z
#1:    Ed   0.8262577 0.5587170 -0.2582699
#2: Frank   0.4317538 2.7221024  1.1807442
#3:  Tony   0.1494439 0.3166547 -1.2029693
Now I have to join the three result sets, and rename the columns:
dt.ans <- dt.med[dt.reg.1][dt.reg.2]
setnames(dt.ans,c("ID","y.med","reg.1.c0","reg.1.c1","reg.2.c0","reg.2.c1","reg.2.c2"))
Finally, here is the desired output for the example:
dt.ans
#      ID     y.med   reg.1.c0  reg.1.c1  reg.2.c0  reg.2.c1   reg.2.c2
#1:    Ed 0.4850342 0.63057884 0.5482373 0.8262577 0.5587170 -0.2582699
#2: Frank 0.4978475 0.69720351 1.3813007 0.4317538 2.7221024  1.1807442
#3:  Tony 0.4179568 0.08588421 1.0179131 0.1494439 0.3166547 -1.2029693
It seems inefficient to calculate the three results, join them, and then rename the columns. Also, my actual tables are largish so I'd like to make sure I don't use too much system memory. Is it possible to do this all within "one" data.table statement? Or more generally, can this be done more efficiently?
I've tried different things. Here is one failed example that gives the median but ignores the regression coefficients:
dt[,as.list(median(y),coef(lm(y ~ x))), by=ID]
#      ID        V1
#1:    Ed 0.4850342
#2: Frank 0.4978475
#3:  Tony 0.4179568
The summary table is a visualization that summarizes statistical information about data in table form. The information is based on one data table in TIBCO Spotfire. You can, at any time, choose which measures you want to see (such as mean, median, etc.), as well as the columns on which to base these measures.
dt[,c(y.med = median(y),
      reg.1 = as.list(coef(lm(y ~ x))),
      reg.2 = as.list(coef(lm(y ~ x + z)))), by=ID]
#      ID     y.med reg.1.(Intercept)   reg.1.x reg.2.(Intercept)      reg.2.x   reg.2.z
#1:    Ed 0.7280448        0.75977555 0.1132509        0.83322290 -0.484348116 0.7655563
#2: Frank 0.6100339       -0.07830664 0.2700846        0.04720686  0.004027939 0.7168521
#3:  Tony 0.2710623       -0.78319379 0.9166601       -0.35836990  0.622822617 0.4161102
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With