One to multiple merge two dataframes if one column string contained in another with Python

Question

I have two dataframes that I would like to merge based on if column value of words from df1 contains column value of keywords from df2. I've been trying to use str.extract. But no luck so far to get the expected result. Example below:

df1:

[{'id': 1, 'words': 'chellomedia', 'languages': nan},
 {'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
 {'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
 {'id': 4, 'words': 'hello world', 'languages': nan},
 {'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
 {'id': 6, 'words': 'hello kitty', 'languages': 'English'},
 {'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
 {'id': 8, 'words': 'hola world', 'languages': nan}]

df2:

[{'code': 1, 'keywords': 'Hello'},
 {'code': 2, 'keywords': 'hola'},
 {'code': 3, 'keywords': 'world'}]

My trial code:

df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()

pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('(' + pat + ')', expand=False))

pd.merge(df1, df2, on='keywords', how='left')

Out:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4     hola   5  ¡hola mundo!       Spanish   2.0
5    hello   6   hello kitty       English   1.0
6      NaN   7   ciao mondo!       Italian   NaN
7     hola   8    hola world           NaN   2.0

But the desired one should be like this:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0  ---> should be generated in df
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8     hola   8    hola world           NaN   2.0
9    world   8    hola world           NaN   3.0  ---> should be generated in df

How could I generate the expected result? Thanks.

U12-Forward · Accepted Answer

Instead of extract you have to use findall and explode, example:

df1.insert(0, 'keywords', df1['words'].str.findall('(' + pat + ')'))
print(pd.merge(df1.explode('keywords'), df2, on='keywords', how='left')
        .sort_values('id').reset_index(drop=True))

Output:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8    world   8    hola world           NaN   3.0
9     hola   8    hola world           NaN   2.0

Exactly the same as what you need :)

One to multiple merge two dataframes if one column string contained in another with Python

Tags:

python

regex

python-3.x

pandas

dataframe

ah bon

1 Answers

U12-Forward

Recent Activity

Donate For Us

One to multiple merge two dataframes if one column string contained in another with Python

Tags:

python

regex

python-3.x

pandas

dataframe

ah bon

1 Answers

U12-Forward

Related questions

Recent Activity

Donate For Us