Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join/Merge 2 data frames by different columns based on available values in R [duplicate]

Tags:

r

I'm looking to join the persons data frame with the colour_match data frame, so that the person data frame will have an additional column "favourite_colour".

    first_name <- c("Nick", "Emily", "Joe", "Kate", "Kate", "Nick", "Joe")
    last_name <- c("Smith", "Doe", "Lee", "Jackson", "Deng", "Smith", "Lee")
    age <- c(28, 30, 52, 18, 42, 28, 52)
    pet <- c("Cat", "Dog", "Rabbit","Cat", "Mouse", "Snake", "Bird")

    person_pet <- data.frame(first_name, last_name, age, pet)

person_pet

first_name <- c("Emily","Nick", "Kate", "Joe", "Kate")
last_name <- c("Doe", "Smith", NA, "Lee", NA)
age <- c(NA, NA, 42, 52, 29)
favourite_colour <- c("black", "yellow", "red", "green", "white")

colour_match <- data.frame(first_name, last_name, age, favorite_colour)

colour_match

The challenge is that I will need to join by as many columns as possible as long as the column has a value in the colour_match data frame. The available columns in the colour_match table determines the "join columns" between the two tables. For example, for "Nick Smith", join by first_name and last_name; for "Joe Lee", join by all three columns; . In the end, the result should look like below: person_favourite_colour

Appreciate any help!

like image 320
Dorisacat Avatar asked Nov 28 '25 13:11

Dorisacat


1 Answers

I would create a small helper function which checks the availability of possible join columns and then joins by all available ones. This function then gets called for each row separately and the result is concatenated again.

library(dplyr, warn.conflicts = F)
library(tidyr)

first_name <- c("Nick", "Emily", "Joe", "Kate", "Kate")
last_name <- c("Smith", "Doe", "Lee", "Jackson", "Deng")
age <- c(28, 30, 52, 18, 42)

persons <- tibble(first_name, last_name, age)

first_name <- c("Nick", "Emily", "Joe", "Kate", "Kate")
last_name <- c("Smith", NA, NA, "Jackson", NA)
age <- c(NA, NA, 52, 18, 29)
favorite_colour <- c("black", "yellow", "red", "green", "white")

colour_match <- tibble(first_name, last_name, age, favorite_colour)

## ---------------------------------------------------------------------------

# helper function
join_if_available <- function(x, y) {
  
  # define possible set of join cols
  join_cols <- c("first_name", "last_name", "age")
  
  # check which (if any) col is missing
  missing <- which(is.na(y[join_cols]))
  
  if (length(missing) > 0) {
    
    # remove missing index and then only keep the available join cols
    # plus the favorite_colour col 
    join_cols <- join_cols[-missing]
    y <- dplyr::select(y, join_cols, favorite_colour)
    
  }
  
  # perform a simple left join
  out <- dplyr::left_join(x, y, by = join_cols)
  
  return(out)
  
}

# split each data frame by row, iterate over each row separately,
# then rowbind the result
purrr::map2_dfr(.x = split(persons, 1:nrow(persons)),
                .y = split(colour_match, 1:nrow(colour_match)),
                .f = ~ join_if_available(x = .x, y = .y))

#> # A tibble: 5 × 4
#>   first_name last_name   age favorite_colour
#>   <chr>      <chr>     <dbl> <chr>          
#> 1 Nick       Smith        28 black          
#> 2 Emily      Doe          30 yellow         
#> 3 Joe        Lee          52 red            
#> 4 Kate       Jackson      18 green          
#> 5 Kate       Deng         42 <NA>

Created on 2023-09-15 by the reprex package (v2.0.1)

like image 166
henhesu Avatar answered Nov 30 '25 05:11

henhesu