Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform conditional mutate on unspecified columns (e.g. matching regex)?

I have a nested data.frame - df_nested, there one of column contains df:

df <- tibble(ID_Value = 1:8,
             xyz001 = c("text4", NA, NA, NA, NA, NA, NA, "text2"), 
             xyz002 = c(NA, NA, NA, "text3", "text1", NA, NA, NA),
             xyz003 = c(NA, "text1", NA, NA, "text2", NA, "text2", NA)) 

I want to find a way, how to mutate this df, on these requirements:

  1. mutate(across(matches("\\d")
  2. there are 4 cases - 4 levels by priority. text4 <- text3 <- text2 <- text1: I need to find and keep column values containing only the highest level text. e.g. if column contains text4, I want to remove text3, text2, text1 and replace them to NA. If it contains multiple highest order text, we should keep all these values (e.g. column xyz003).
  3. how to apply these conditions not specifying column name, because there can be any number in column name.
  4. if column contains all NAs, do nothing.

my attempt:

df_nested <- df_nested %>%
    mutate(df = map(data, ~.x %>%
       mutate(across(matches("\\dd"), function (x) {
                      conditions (ifelse, case_when or other)
                      ...}

Also, should we better use across(), or is vars() still a good way to do it as well? Thank you in advance.

Expected Output

df <- tibble(ID_Value = 1:8,
             xyz001 = c("text4", NA, NA, NA, NA, NA, NA, NA), 
             xyz002 = c(NA, NA, NA, "text3", NA, NA, NA, NA),
             xyz003 = c(NA, NA, NA, NA, "text2", NA, "text2", NA))
like image 739
Don el Fuego Integer Avatar asked Nov 23 '25 22:11

Don el Fuego Integer


1 Answers

  1. Use a factor type to specify the order you want.
  2. Do row- or column-wise operations across the matches.

Consider this function

max_only <- function(x, lvls) {
  fct <- droplevels(factor(x, lvls))
  `[<-`(x, as.integer(fct) != length(levels(fct)), NA_character_)
}

You can then specify any order you want

> max_only(c("apple", "banana", NA_character_), c("banana", "apple"))
[1] "apple" NA      NA     
> max_only(c("apple", "banana", NA_character_), c("apple", "banana"))
[1] NA       "banana" NA   

Case 1: column-wise operations

df %>% 
  mutate(across(matches("\\d"), max_only, c("tier1", "tier2", "tier3", "tier4")))

Output (this one looks more like your expected output)

# A tibble: 8 x 4
  ID_Value xyz001 xyz002 xyz003
     <int> <chr>  <chr>  <chr> 
1        1 tier4  NA     NA    
2        2 NA     NA     NA    
3        3 NA     NA     NA    
4        4 NA     tier3  NA    
5        5 NA     NA     tier2 
6        6 NA     NA     NA    
7        7 NA     NA     tier2 
8        8 NA     NA     NA    

Case 2: row-wise operations

df %>% 
  mutate(as.data.frame(t(apply(
    across(matches("\\d")), 1L, 
    max_only, c("tier1", "tier2", "tier3", "tier4")
  ))))

Output

# A tibble: 8 x 4
  ID_Value xyz001 xyz002 xyz003
     <int> <chr>  <chr>  <chr> 
1        1 tier4  NA     NA    
2        2 NA     NA     tier1 
3        3 NA     NA     NA    
4        4 NA     tier3  NA    
5        5 NA     NA     tier2 
6        6 NA     NA     NA    
7        7 NA     NA     tier2 
8        8 tier2  NA     NA    

Explanations

  1. [<- is almost equivalent to x[...] <- y; x. If ... is a logical vector (i.e. TRUE/FALSE), then values in x indexed by TRUE will be replaced by y. For example,

     > x <- c("a", "b" ,"c")
     > `[<-`(x, c(FALSE, TRUE, TRUE), NA_character_)
     [1] "a" NA  NA 
     > x[c(FALSE, TRUE, TRUE)] <- NA_character_; x
     [1] "a" NA  NA 
    
  2. NA_character_ is the NA value of a character type.

  3. as.integer(fct) != length(levels(fct)) returns a logical vector of the same length as fct. TRUE indexes positions where the values of fct are not the highest level, FALSE indexes the opposite, and NA indexes NAs. For example, assume that fct looks like this

     > x <- c("apple", "banana", NA)
     > fct <- droplevels(factor(x, c("apple", "banana", "pear")))
     > fct
     [1] apple  banana <NA>  
     Levels: apple banana
    

    Then, you can see that

     > as.integer(fct) != length(levels(fct))
     [1]  TRUE FALSE    NA 
    
  4. All together, it just means that we assign NA_character_s to values that are not equal to the highest level but keep NAs unchanged.

    [<-(x, as.integer(fct) != length(levels(fct)), NA_character_)
    
like image 82
ekoam Avatar answered Nov 25 '25 11:11

ekoam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!