Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge two dataframes by partial string match

Tags:

python

r

I am trying to merge two fairly large dataframes of different sizes based on partial string matches.

df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.

Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.

Example dataframes:

df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
              name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))

df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
              color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))

df3 (merged)

code_1         code_2         name  color
123456789012   123456789012   bob   blue
210987654321   2109876543     joe   red
567890543211   7890543211     sally green
987656789001   98765678900    john  purple
123456654321   12345665432    lucy  orange
678905432156   678905432156   alan  brown
like image 645
vethno Avatar asked Mar 06 '26 03:03

vethno


2 Answers

Try this SQL join.

library(sqldf)

sqldf("select a.code_1, b.code_2, a.name, b.color 
       from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")

giving:

        code_1       code_2  name  color
1 123456789012 123456789012   bob   blue
2 210987654321   2109876543   joe    red
3 567890543211   7890543211 sally  green
4 987656789001  98765678900  john purple
5 123456654321  12345665432  lucy orange
6 678905432156 678905432156  alan  brown

Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2.

like image 76
G. Grothendieck Avatar answered Mar 08 '26 17:03

G. Grothendieck


We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. Next, we merge on matchID to get desired output:

df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))

df_merge = merge(df1, df2, by = "matchID")[-1]

Note that if a df1$code does not match any df2$code, df2$matchID will be blank, and so would not merge with df1$matchID.

Results:

> df2
          code  color matchID
1 123456789012   blue       1
2   2109876543    red       2
3   7890543211  green       3
4  98765678900 purple       4
5  12345665432 orange       5
6 678905432156  brown       6
7  14124124124  black        

> df_merge
        code.x  name       code.y  color
1 123456789012   bob 123456789012   blue
2 210987654321   joe   2109876543    red
3 567890543211 sally   7890543211  green
4 987656789001  john  98765678900 purple
5 123456654321  lucy  12345665432 orange
6 678905432156  alan 678905432156  brown

Data (Added non-match for better demo):

df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
                  name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
                  stringsAsFactors = FALSE)

df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
                  color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
                  stringsAsFactors = FALSE)
like image 22
acylam Avatar answered Mar 08 '26 16:03

acylam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!