Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New column based on values ​from other columns AND respecting pre-established rules

I'm looking for an algorithm to create a new column based on values ​​from other columns AND respecting pre-established rules. Here's an example:

artificial data

df = data.frame(
  col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
  col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
  col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)

The goal is to create a new_column based on the values ​​of col_1, col_2, and col_3. For that, the rules are:

  • If the value 'Yes' is present in any of the columns, the value of the new_column will be 'Yes';
  • If the value 'Yes' is not present in any of the columns, but the value 'No' is present, then the value of the new_column will be 'No';
  • If the values ​​'Yes' and 'No' are absent, then the value of new_columns will be 'Unknown'.

I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.

Current solution:

library(dplyr)
df_1 <-
  df %>%
  mutate(
    new_column = ifelse(
      (col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
      ifelse(
        (col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
        )
      )
    )

I'm looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.

After searching for StackOverflow, I couldn't find a way to my problem (I know there are several posts about creating a new column based on values ​​obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.

I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.

like image 987
Henrique Avatar asked Oct 17 '25 10:10

Henrique


1 Answers

Try this using dplyr rowwise function

library(dplyr)

df |> rowwise() |> mutate(new_column = case_when(any(c_across() == "Yes") ~ "Yes" ,
any(c_across() == "No") ~ "No" , TRUE ~ "Unknown")) |> ungroup()
  • output
# A tibble: 10 × 4
   col_1   col_2   col_3   new_column
   <chr>   <chr>   <chr>   <chr>     
 1 No      Yes     Unknown Yes       
 2 Yes     Yes     Yes     Yes       
 3 Yes     Unknown Yes     Yes       
 4 Yes     Yes     Unknown Yes       
 5 Yes     Unknown Unknown Yes       
 6 Yes     No      No      Yes       
 7 No      Unknown No      No        
 8 No      No      Unknown No        
 9 No      Unknown Unknown No        
10 Unknown Unknown Unknown Unknown   
  • data
df <- structure(list(col_1 = c("No", "Yes", "Yes", "Yes", "Yes", "Yes", 
"No", "No", "No", "Unknown"), col_2 = c("Yes", "Yes", "Unknown", 
"Yes", "Unknown", "No", "Unknown", "No", "Unknown", "Unknown"
), col_3 = c("Unknown", "Yes", "Yes", "Unknown", "Unknown", "No", 
"No", "Unknown", "Unknown", "Unknown")), class = "data.frame", row.names = c(NA, 
-10L))
like image 186
Mohamed Desouky Avatar answered Oct 19 '25 00:10

Mohamed Desouky