When using .SD to apply a function to a subset of dt's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... e.g.
# Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
# a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6
# Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*` , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
# a a
#1: 4 4
#2: 10 10
#3: 6 6
#4: 12 12
I couldn't get it to work with when .SDcols was a character vector of column names so I tried numeric positions (which( names(dt) %in% "a" ) gives a vector [1] 1 3 ) but it also seems to just multiply the first a column only. Am I doing something wrong?
.SDcolsAdvanced. Specifies the columns of x included in .SD. May be character column names or numeric positions.
These also returned the same result as above...
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
packageVersion("data.table")
#[1] ‘1.8.11’
How about this
dt[, "a"] * 2
## a a.1
## 1 4 8
## 2 10 6
## 3 6 12
## 4 12 6
For more detailed discussion
https://chat.stackoverflow.com/transcript/message/12783493#12783493
Consistent subset rules on
data.tableswith duplicate columns. In short, if indices are directly provided, 'j', or in.SDcols, then just those columns are either returned (or deleted if you provide -.SDcolsor!j). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #5688 and #5008. Note that usingby=to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.
Basically, if you do:
dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
# a a
# 1: 4 4
# 2: 10 10
# 3: 6 6
# 4: 12 12
It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.
But if you clearly specify (as you do in your Q):
dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
# a a
# 1: 4 8
# 2: 10 6
# 3: 6 12
# 4: 12 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With