Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform left join using multiple columns where one data frame has missingness in one column?

Suppose I have two data frames like the following:

df1 <- data.frame(
    id1  = c(1:4, 4),
    id2  = letters[1:5],
    val1 = c(0, 1, pi, exp(1), 42)
)

df2 <- data.frame(
    id1  = c(1:4, 4),
    id2  = c(NA, letters[2:5]),
    val2 = c("Lorem", "ipsum", "dolor", "sit", "amet")
)

##   df1                   df2
##   id1 id2      val1     id1  id2  val2
## 1   1   a  0.000000       1 <NA> Lorem
## 2   2   b  1.000000       2    b ipsum
## 3   3   c  3.141593       3    c dolor
## 4   4   d  2.718282       4    d   sit
## 5   4   e 42.000000       4    e  amet

This would be my desired result:

desired_result <- data.frame(
    id1  = c(1:4, 4),
    id2  = letters[1:5],
    val1 = c(0, 1, pi, exp(1), 42),
    val2 = c("Lorem", "ipsum", "dolor", "sit", "amet")
)

##   id1 id2      val1  val2
## 1   1   a  0.000000 Lorem
## 2   2   b  1.000000 ipsum
## 3   3   c  3.141593 dolor
## 4   4   d  2.718282   sit
## 5   4   e 42.000000  amet

In my desired result, I'd like to use the information in column id2, when available, to resolve join ambiguities raised by duplicate values in id1. For example, rows 4 and 5 have the same id1, but we can differentiate them by id2. So, if I try to join just on id1, I get too many observations, because I'm not utilizing this information in id2 to match records:

library(dplyr)
left_join(df1, df2, by = "id1")
##   id1 id2.x      val1 id2.y  val2
## 1   1     a  0.000000  <NA> Lorem
## 2   2     b  1.000000     b ipsum
## 3   3     c  3.141593     c dolor
## 4   4     d  2.718282     d   sit
## 5   4     d  2.718282     e  amet
## 6   4     e 42.000000     d   sit
## 7   4     e 42.000000     e  amet

However, if I try to join on both IDs, I lose the information in val2 for row 1, because df2 has NA for id2 on row 1:

left_join(df1, df2, by = c("id1", "id2"))
##  id1 id2      val1  val2
## 1   1   a  0.000000  <NA>
## 2   2   b  1.000000 ipsum
## 3   3   c  3.141593 dolor
## 4   4   d  2.718282   sit
## 5   4   e 42.000000  amet

How can I left_join() (or equivalent) to achieve my desired_result?

like image 753
duckmayr Avatar asked Dec 05 '25 20:12

duckmayr


1 Answers

An option using data.table:

library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=.(id1, id2), mult="first", val2 := val2]
df1[is.na(val2), val2 :=
    df2[.SD, on=.(id1), mult="first", val2]]

I have taken the liberty of using the first value if there are multiple joins (i.e. the combination of id1 and id2 in df2 are not unique).

like image 153
chinsoon12 Avatar answered Dec 08 '25 12:12

chinsoon12



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!