Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to add a calculated column

I have a dataframe in which I want to check some condition and need to add a new column based on result of condition.

This is my input data

InputData = data.frame(A = c("", "", "Apple"), B = c("", "", "Orange"), C = c("", "", ""), D = c(0, 1, 1))

This is my desired output

OutputData = InputData %>%
  mutate(R = case_when(A=='' & B=='' & C=='' & D==0 ~ "Yes", TRUE ~ "No"))

I have tried mutate with Case function. It is working fine but it takes longer time when I have more number of rows.

Please help me to do it in faster way.

like image 688
ParthaSarathi Avatar asked Feb 02 '26 06:02

ParthaSarathi


2 Answers

I'm surprised that your code is slow with such small data (only 100k rows). I would do it like this:

InputData$R <- "No"
InputData[InputData$A == '' & InputData$B == '' &
            InputData$C == '' & InputData$D == 0, "R"] <- "Yes"

However, I strongly recommend using logical values instead of "Yes"/"No":

InputData$S <- InputData$A == '' & InputData$B == '' &
  InputData$C == '' & InputData$D == 0
#      A      B C D   R     S
#1                0 Yes  TRUE
#2                1  No FALSE
#3 Apple Orange   1  No FALSE

If that is still too slow, package data.table can help. But it shouldn't be necessary unless the data gets actually large.

like image 197
Roland Avatar answered Feb 03 '26 20:02

Roland


A simple benchmark is following. It seems that if you just want to dichotomize a column, if_else is preferable to case_when in dplyr. If you care about the speed, change the workflow to base like @Roland's answer.

InputData = data.frame(A = sample(c('x', ''), 1e5, TRUE),
                       B = sample(c('x', ''), 1e5, TRUE),
                       C = sample(c('x', ''), 1e5, TRUE),
                       D = sample(0:1, 1e5, TRUE))

library(dplyr)

bench::mark(
  "base::ifelse" = InputData %>% mutate(R = ifelse(A == '' & B == '' & C == '' & D == 0, "Yes", "No")),
  "dplyr::case_when" = InputData %>% mutate(R = case_when(A == '' & B == '' & C == '' & D == 0 ~ "Yes", TRUE ~ "No")),
  "dplyr::if_else" = InputData %>% mutate(R = if_else(A == '' & B == '' & C == '' & D == 0, "Yes", "No")),
  "base::repalce" = InputData %>% mutate(R = "No", R = replace(R, A == '' & B == '' & C == '' & D == 0, "Yes")),
  "base::`[<-`.Roland" = local({
    InputData$R <- "No"
    InputData$R[InputData$A == '' & InputData$B == '' & InputData$C == '' & InputData$D == 0] <- "Yes"
    InputData
  }),
  iterations = 100
)

# # A tibble: 5 × 9
#   expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 base::ifelse        24.87ms  25.82ms      38.0    7.63MB     17.1    69    31      1.82s
# 2 dplyr::case_when    15.65ms  16.91ms      57.0     8.4MB     24.4    70    30      1.23s
# 3 dplyr::if_else       6.77ms   7.17ms     133.     6.87MB     39.6    77    23   580.57ms
# 4 base::repalce         5.6ms    5.9ms     166.     5.75MB     36.4    82    18    495.1ms
# 5 base::`[<-`.Roland   3.47ms   3.52ms     269.     3.84MB     33.2    89    11   331.35ms
like image 20
Darren Tsai Avatar answered Feb 03 '26 22:02

Darren Tsai



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!