I'm trying to understand the behaviour of eval in a data.table as a "frame".
With following data.table:
set.seed(1)
foo = data.table(var1=sample(1:3,1000,r=T), var2=rnorm(1000), var3=sample(letters[1:5],1000,replace = T))
I'm trying to replicate this instruction
foo[var1==1 , sum(var2) , by=var3]
using a function of eval:
eval1 = function(s) eval( parse(text=s) ,envir=sys.parent() )
As you can see, test 1 and 3 are working, but I don't understand which is the "correct" envir to set in eval for test 2:
var_i="var1"
var_j="var2"
var_by="var3"
# test 1 works
foo[eval1(var_i)==1 , sum(var2) , by=var3 ]
# test 2 doesn't work
foo[var1==1 , sum(eval1(var_j)) , by=var3]
# test 3 works
foo[var1==1 , sum(var2) , by=eval1(var_by)]
The j-exp, checks for it's variables in the environment of .SD, which stands for Subset of Data. .SD is itself a data.table that holds the columns for that group.
When you do:
foo[var1 == 1, sum(eval(parse(text=var_j))), by=var3]
directly, the j-exp gets internally optimised/replaced to sum(var2). But sum(eval1(var_j)) doesn't get optimised, and stays as it is.
Then when it gets evaluated for each group, it'll have to find var2, which doesn't exist in the parent.frame() from where the function is called, but in .SD. As an example, let's do this:
eval1 <- function(s) eval(parse(text=s), envir=parent.frame())
foo[var1 == 1, { var2 = 1L; eval1(var_j) }, by=var3]
# var3 V1
# 1: e 1
# 2: c 1
# 3: a 1
# 4: b 1
# 5: d 1
It find var2 from it's parent frame. That is, we have to point to the right environment to evaluate in, with an additional argument with value = .SD.
eval1 <- function(s, env) eval(parse(text=s), envir = env, enclos = parent.frame())
foo[var1 == 1, sum(eval1(var_j, .SD)), by=var3]
# var3 V1
# 1: e 11.178035
# 2: c -12.236446
# 3: a -8.984715
# 4: b -2.739386
# 5: d -1.159506
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With