Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R replace positive values in logical columns with the column names using data.table syntax

I have a data set that contains some logical columns and would like to replace values that are 'TRUE' with the corresponding column name. I asked a similar question here and was able to identify an appropriate solution with the help of some suggestions from other S/O users. However, the solution does not use data.table syntax and copies the whole dataset instead of replacing by reference, which is time consuming.

What is the most appropriate way to do this using data.table syntax?

I tried this:

# Load library    
library(data.table)

# Create dummy data.table:
mydt <- data.table(id = c(1,2,3,4,5), 
                   ptname = c("jack", "jill", "jo", "frankie", "claire"), 
                   sex = c("m", "f", "f", "m", "f"), apple = c(T,F,F,T,T), 
                   orange = c(F,T,F,T,F), 
                   pear = c(T,T,T,T,F))

# View dummy data:
> mydt
   id  ptname sex apple orange  pear
1:  1    jack   m  TRUE  FALSE  TRUE
2:  2    jill   f FALSE   TRUE  TRUE
3:  3      jo   f FALSE  FALSE  TRUE
4:  4 frankie   m  TRUE   TRUE  TRUE
5:  5  claire   f  TRUE  FALSE FALSE

# Function to recode values in a data.table:
recode.multi <- function(datacol, oldval, newval) {
  trans <- setNames(newval, oldval)
  trans[ match(datacol, names(trans)) ]
}

# Get a list of all the logical columns in the data set:
logicalcols <- names(which(mydt[, sapply(mydt, is.logical)] == TRUE))

# Apply the function to convert 'TRUE' to the relevant column names:
mydt[, (logicalcols) := lapply(.SD, recode.multi, 
                               oldval = c(FALSE, TRUE), 
                               newval = c("FALSE", names(.SD))), .SDcols = logicalcols]

# View the result:
> mydt
   id  ptname sex apple orange  pear
1:  1    jack   m apple  FALSE apple
2:  2    jill   f FALSE  apple apple
3:  3      jo   f FALSE  FALSE apple
4:  4 frankie   m apple  apple apple
5:  5  claire   f apple  FALSE FALSE

This isn't correct as instead of iterating through each column name for the replacement values, it just recycles the first one ("apple" in this case).

Moreover, if I reverse the order of old and new values, the function ignores my character string replacement for the second value and uses the first two column names as replacements in all cases:

# Apply the function with order of old and new values reversed:
mydt[, (logicalcols) := lapply(.SD, recode.multi, 
                               oldval = c(TRUE, FALSE), 
                               newval = c(names(.SD), "FALSE")), .SDcols = logicalcols]

# View the result:
> mydt
   id  ptname sex  apple orange   pear
1:  1    jack   m  apple orange  apple
2:  2    jill   f orange  apple  apple
3:  3      jo   f orange orange  apple
4:  4 frankie   m  apple  apple  apple
5:  5  claire   f  apple orange orange

I'm sure I'm probably missing something simple but does anyone know why the function does not iterate through the column names (and how to edit it to do this)?

My expected output would be as follows:

> mydt
   id  ptname sex apple orange  pear
1:  1    jack   m apple  FALSE  pear
2:  2    jill   f FALSE orange  pear
3:  3      jo   f FALSE  FALSE  pear
4:  4 frankie   m apple orange  pear
5:  5  claire   f apple  FALSE FALSE

Alternatively any other suggestions of concise data.table syntax to achieve this would be much appreciated.

like image 413
Amy M Avatar asked Oct 19 '25 12:10

Amy M


1 Answers

We can use a melt/dcast approach

dcast(melt(mydt, id.var = c("id", "ptname", "sex"))[,
     value1 := as.character(value)][(value), value1 := variable], 
            id + ptname + sex~variable, value.var = "value1")
#   id  ptname sex apple orange  pear
#1:  1    jack   m apple  FALSE  pear
#2:  2    jill   f FALSE orange  pear
#3:  3      jo   f FALSE  FALSE  pear
#4:  4 frankie   m apple orange  pear
#5:  5  claire   f apple  FALSE FALSE

Or another option is with set which would be more efficient

nm1 <- which(unlist(mydt[, lapply(.SD, class)])=="logical")
for(j in nm1){
    i1 <- which(mydt[[j]])
    set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
    set(mydt, i = i1, j=j, value = names(mydt)[j])
}

mydt
#   id  ptname sex apple orange  pear
#1:  1    jack   m apple  FALSE  pear
#2:  2    jill   f FALSE orange  pear
#3:  3      jo   f FALSE  FALSE  pear
#4:  4 frankie   m apple orange  pear
#5:  5  claire   f apple  FALSE FALSE

Or another option mentioned in the comments is

mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
mydt
#   id  ptname sex apple orange  pear
#1:  1    jack   m apple  FALSE  pear
#2:  2    jill   f FALSE orange  pear
#3:  3      jo   f FALSE  FALSE  pear
#4:  4 frankie   m apple orange  pear
#5:  5  claire   f apple  FALSE FALSE

UPDATE: Comparing options two and three (one is not possible due to the number of non-logical columns) with a data set comprising 18573 rows and 650 columns, of which 252 columns are logical runs with the following timings:

# Option 2:
  nm1 <- which(unlist(mydt[, lapply(.SD, is.logical)])) 
  system.time( 
   for(j in nm1){ 
     i1 <- which(mydt[[j]]) 
     set(mydt, i=NULL, j=j, value = as.character(mydt[[j]])) 
     set(mydt, i = i1, j=j, value = names(mydt)[j]) 
     } 
   ) 
 # user system elapsed 
 #  0.61 0.00 0.61

# Option 3:
system.time( 
  mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1] 

   ) 
#user system elapsed 
#0.65 0.00 0.66

Both are significantly faster than the original approach not using data.table syntax:

# Original approach:
logitrue <- which(mydt == TRUE, arr.ind = T)
 system.time(
   mydt[logitrue, ] <- colnames(mydt)[logitrue[,2]]
 )
  # user  system elapsed 
  # 1.22    0.03    4.22 
like image 112
akrun Avatar answered Oct 21 '25 03:10

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!