Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One to multiple merge two dataframes if one column string contained in another with Python

I have two dataframes that I would like to merge based on if column value of words from df1 contains column value of keywords from df2. I've been trying to use str.extract. But no luck so far to get the expected result. Example below:

df1:

[{'id': 1, 'words': 'chellomedia', 'languages': nan},
 {'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
 {'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
 {'id': 4, 'words': 'hello world', 'languages': nan},
 {'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
 {'id': 6, 'words': 'hello kitty', 'languages': 'English'},
 {'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
 {'id': 8, 'words': 'hola world', 'languages': nan}]

df2:

[{'code': 1, 'keywords': 'Hello'},
 {'code': 2, 'keywords': 'hola'},
 {'code': 3, 'keywords': 'world'}]

My trial code:

df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()

pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('(' + pat + ')', expand=False))

pd.merge(df1, df2, on='keywords', how='left')

Out:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4     hola   5  ¡hola mundo!       Spanish   2.0
5    hello   6   hello kitty       English   1.0
6      NaN   7   ciao mondo!       Italian   NaN
7     hola   8    hola world           NaN   2.0

But the desired one should be like this:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0  ---> should be generated in df
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8     hola   8    hola world           NaN   2.0
9    world   8    hola world           NaN   3.0  ---> should be generated in df

How could I generate the expected result? Thanks.

like image 706
ah bon Avatar asked Dec 18 '25 19:12

ah bon


1 Answers

Instead of extract you have to use findall and explode, example:

df1.insert(0, 'keywords', df1['words'].str.findall('(' + pat + ')'))
print(pd.merge(df1.explode('keywords'), df2, on='keywords', how='left')
        .sort_values('id').reset_index(drop=True))

Output:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8    world   8    hola world           NaN   3.0
9     hola   8    hola world           NaN   2.0

Exactly the same as what you need :)

like image 184
U12-Forward Avatar answered Dec 20 '25 10:12

U12-Forward



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!