Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate two columns of a dataframe in R if all values between them are NA?

Tags:

dataframe

r

na

I have a data frame that looks like this:

> sample
# A tibble: 6 x 10
  Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier 
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>   <chr>    
1       1       2       3       4       8      NA      NA      NA NA      orioles  
2       1       2       3       4       9      13      NA      NA NA      nationals
3       1       2       3       5      10      14      16      18 NA      dodgers  
4       1       2       3       5      10      14      17      19 NA      cardinals
5       1       2       3       6      11      NA      NA      NA NA      giants   
6       1       2       3       7      12      15      NA      NA NA      padres  

What I'd like to do is concatenate the Supplier column with any Level column if all the values between them are NA. Another way I was thinking about this was that if the column to the right of the Level column is NA then to concatenate that column with the supplier column.

I was thinking a for loop but I haven't figured out how to implement the logic. The logic I was thinking is something like:

for (level in levels) {
   if is.na(level n + 1) {
     paste0(level, Supplier)
     }
   else {
    level}
   }

I could also do a bunch of mutate calls like this but it seems super repetitive and unnecessary:

sample %>%
  mutate(
    Level_5 = ifelse(
      is.na(Level_6),
      paste0(Supplier, "<br>", Level_5),
      Level_5)
  )

Here's the dput of the data:

structure(list(Level_1 = c(1, 1, 1, 1, 1, 1), Level_2 = c(2, 
2, 2, 2, 2, 2), Level_3 = c(3, 3, 3, 3, 3, 3), Level_4 = c(4, 
4, 5, 5, 6, 7), Level_5 = c(8, 9, 10, 10, 11, 12), Level_6 = c(NA, 
13, 14, 14, NA, 15), Level_7 = c(NA, NA, 16, 17, NA, NA), Level_8 = c(NA, 
NA, 18, 19, NA, NA), Level_9 = c(NA, NA, NA, NA, NA, NA), Supplier = c("orioles", 
"nationals", "dodgers", "cardinals", "giants", "padres")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
like image 938
Ben G Avatar asked Dec 13 '22 07:12

Ben G


2 Answers

To be honest, I'm not 100% sure about your desired output. Using dplyr and tidyr:

library(tidyr)
library(dplyr)

sample %>%
  pivot_longer(cols=starts_with("Level_"), names_prefix="Level_", names_to="level") %>%
  drop_na() %>%
  group_by(Supplier) %>%
  mutate(new_val=ifelse(level==max(level), paste0(Supplier, "<br>", value), value)) %>%
  select(-value) %>%
  pivot_wider(names_from=level, names_prefix="Level_", values_from=new_val)

returns

# A tibble: 6 x 9
# Groups:   Supplier [6]
  Supplier  Level_1 Level_2 Level_3 Level_4 Level_5      Level_6         Level_7 Level_8        
  <chr>     <chr>   <chr>   <chr>   <chr>   <chr>        <chr>           <chr>   <chr>          
1 orioles   1       2       3       4       orioles<br>8 NA              NA      NA             
2 nationals 1       2       3       4       9            nationals<br>13 NA      NA             
3 dodgers   1       2       3       5       10           14              16      dodgers<br>18  
4 cardinals 1       2       3       5       10           14              17      cardinals<br>19
5 giants    1       2       3       6       giants<br>11 NA              NA      NA             
6 padres    1       2       3       7       12           padres<br>15    NA      NA  

I lost the Level_9 column since it contained only NA. You can easily add it again.

like image 125
Martin Gal Avatar answered Dec 15 '22 20:12

Martin Gal


Final Update I realized my mistake on trying to find the max value in every row and replace it with desired concatenated string. So I came up with another solution which only replaces the last non-NA value (it can also be not the max values of the row), given all values are not numeric. So here is my final solution:

library(dplyr)
library(stringr)
library(purrr)

df %>%
  pmap_dfr(., ~ {x <- c(...)[-10][!is.na(c(...)[-10])];
  ind <- which(c(...) == x[length(x)]);
  replace(c(...), ind[length(ind)], str_c(..10, x[length(x)], sep = "_"))}
  )

# A tibble: 6 x 10
  Level_1 Level_2 Level_3 Level_4 Level_5   Level_6      Level_7 Level_8      Level_9 Supplier 
  <chr>   <chr>   <chr>   <chr>   <chr>     <chr>        <chr>   <chr>        <chr>   <chr>    
1 1       2       3       4       orioles_8 NA           NA      NA           NA      orioles  
2 1       2       3       4       9         nationals_13 NA      NA           NA      nationals
3 1       2       3       5       10        14           16      dodgers_18   NA      dodgers  
4 1       2       3       5       10        14           17      cardinals_19 NA      cardinals
5 1       2       3       6       giants_11 NA           NA      NA           NA      giants   
6 1       2       3       7       12        padres_15    NA      NA           NA      padres 
like image 44
Anoushiravan R Avatar answered Dec 15 '22 20:12

Anoushiravan R