Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter strings across multiple columns with data.table

I have a dataset that looks something like this.

df <- tibble::tribble(
  ~name,           ~x,  ~y,              ~z,  
  "N/A",            1,   "testSmith",    -100, 
  "N A",            3,   "NOt available", -99,
  "test Smith",     NA,  "test Smith",    -98,
  "Not Available", -99, "25",             -101,
  "test Smith",    -98, "28",             -1)

I would like to create a new data.table that keeps all the rows the string "test".

The final dataset should look something like this

  name           x y              z
  <chr>      <dbl> <chr>      <dbl>
1 N/A            1 testSmith   -100
2 test Smith    NA test Smith   -98
3 test Smith   -98 28            -1

I could do this column by column like this

setDT(df)[name%like%"test"|y%like%"test"]

The problem with this approach is that I have hundreds of string variables and I would like to find a more compact approach. I tried the followings but they do not work

chvar <- keep(trai,is.character)%>%names()
setDT(df)[chvar%like%"test"]#error
setDT(df)[(chvar)%like%"test"]#error
setDT(df)[.(chvar)%like%"test"]#empty dt

Does someone know how I could do it in a quick and efficient way?

Thanks a lot for your help

like image 908
Alex Avatar asked Mar 20 '26 13:03

Alex


1 Answers

In data.table you can do :

library(data.table)

cols <- c('name', 'y')
setDT(df)

df[df[, Reduce(`|`, lapply(.SD, `%like%`, "test")), .SDcols = cols]]

#         name   x          y    z
#1:        N/A   1  testSmith -100
#2: test Smith  NA test Smith  -98
#3: test Smith -98         28   -1

In base R :

subset(df, Reduce(`|`, lapply(df[cols], function(x) grepl('test', x))))

dplyr :

library(dplyr)
df %>% filter(Reduce(`|`, across(all_of(cols), ~grepl('test', .x))))

lapply/across returns a list of TRUE/FALSE values for all columns. It will return TRUE if 'test' is present and FALSE if it is not present. When we use it in combination with Reduce and | it will give TRUE only of there is atleast one TRUE value in the row. If all the values in the row are FALSE it will returns FALSE. We select only those rows which has at least one TRUE value in it.

like image 73
Ronak Shah Avatar answered Mar 23 '26 06:03

Ronak Shah