Merge two dataframes by partial string match

Question

I am trying to merge two fairly large dataframes of different sizes based on partial string matches.

df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.

Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.

Example dataframes:

df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
              name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))

df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
              color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))

df3 (merged)

code_1         code_2         name  color
123456789012   123456789012   bob   blue
210987654321   2109876543     joe   red
567890543211   7890543211     sally green
987656789001   98765678900    john  purple
123456654321   12345665432    lucy  orange
678905432156   678905432156   alan  brown

G. Grothendieck · Accepted Answer

Try this SQL join.

library(sqldf)

sqldf("select a.code_1, b.code_2, a.name, b.color 
       from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")

giving:

        code_1       code_2  name  color
1 123456789012 123456789012   bob   blue
2 210987654321   2109876543   joe    red
3 567890543211   7890543211 sally  green
4 987656789001  98765678900  john purple
5 123456654321  12345665432  lucy orange
6 678905432156 678905432156  alan  brown

Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2.

acylam · Answer

We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. Next, we merge on matchID to get desired output:

df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))

df_merge = merge(df1, df2, by = "matchID")[-1]

Note that if a df1$code does not match any df2$code, df2$matchID will be blank, and so would not merge with df1$matchID.

Results:

> df2
          code  color matchID
1 123456789012   blue       1
2   2109876543    red       2
3   7890543211  green       3
4  98765678900 purple       4
5  12345665432 orange       5
6 678905432156  brown       6
7  14124124124  black        

> df_merge
        code.x  name       code.y  color
1 123456789012   bob 123456789012   blue
2 210987654321   joe   2109876543    red
3 567890543211 sally   7890543211  green
4 987656789001  john  98765678900 purple
5 123456654321  lucy  12345665432 orange
6 678905432156  alan 678905432156  brown

Data (Added non-match for better demo):

df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
                  name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
                  stringsAsFactors = FALSE)

df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
                  color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
                  stringsAsFactors = FALSE)

Merge two dataframes by partial string match

Tags:

python

r

vethno

2 Answers

G. Grothendieck

acylam

Recent Activity

Donate For Us

Merge two dataframes by partial string match

Tags:

python

r

vethno

2 Answers

G. Grothendieck

acylam

Related questions

Recent Activity

Donate For Us