I have two dataframes that I would like to merge based on if column value of words from df1 contains column value of keywords from df2. I've been trying to use str.extract. But no luck so far to get the expected result. Example below:
df1:
[{'id': 1, 'words': 'chellomedia', 'languages': nan},
{'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
{'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
{'id': 4, 'words': 'hello world', 'languages': nan},
{'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
{'id': 6, 'words': 'hello kitty', 'languages': 'English'},
{'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
{'id': 8, 'words': 'hola world', 'languages': nan}]
df2:
[{'code': 1, 'keywords': 'Hello'},
{'code': 2, 'keywords': 'hola'},
{'code': 3, 'keywords': 'world'}]
My trial code:
df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()
pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('(' + pat + ')', expand=False))
pd.merge(df1, df2, on='keywords', how='left')
Out:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 hola 5 ¡hola mundo! Spanish 2.0
5 hello 6 hello kitty English 1.0
6 NaN 7 ciao mondo! Italian NaN
7 hola 8 hola world NaN 2.0
But the desired one should be like this:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 world 4 hello world NaN 3.0 ---> should be generated in df
5 hola 5 ¡hola mundo! Spanish 2.0
6 hello 6 hello kitty English 1.0
7 NaN 7 ciao mondo! Italian NaN
8 hola 8 hola world NaN 2.0
9 world 8 hola world NaN 3.0 ---> should be generated in df
How could I generate the expected result? Thanks.
Instead of extract you have to use findall and explode, example:
df1.insert(0, 'keywords', df1['words'].str.findall('(' + pat + ')'))
print(pd.merge(df1.explode('keywords'), df2, on='keywords', how='left')
.sort_values('id').reset_index(drop=True))
Output:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 world 4 hello world NaN 3.0
5 hola 5 ¡hola mundo! Spanish 2.0
6 hello 6 hello kitty English 1.0
7 NaN 7 ciao mondo! Italian NaN
8 world 8 hola world NaN 3.0
9 hola 8 hola world NaN 2.0
Exactly the same as what you need :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With