Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join dataframes with multiple IDs?

I have two dataframes and a rather tricky join to accomplish.

The first dataframe:

data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
 
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1 

Output:

RuleSetID   RuleSetName    KeyWordGroupID
    0         Standard1    [100, 101, 102]
    1         Standard2    [100, 102]
    2         Standard3    [103]
   ...         ...          ... 

The second one:

data = [[100, 'verahren', ['word1', 'word2']], 
        [101, 'flaechen', ['word3']], 
        [102, 'nutzung', ['word4', 'word5']],
        [103, 'ort', ['word6', 'word7']]]
 
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2

Output:

KeyWordGroupID  KeyWordGroupName    KeyWords
    100               verahren      ['word1', 'word2']
    101               flaechen      ['word3']
    102               nutzung       ['word4', 'word5']
    103               ort           ['word6', 'word7']
    ...               ...            ...

The desired output:

RuleSetID   RuleSetName    KeyWordGroupID
    0         Standard1    [['word1', 'word2'], ['word3'], ['word4', 'word5']]
    1         Standard2    [['word1', 'word2'], ['word4', 'word5']]
    2         Standard3    [['word6', 'word7']]

I tried to convert the second dataframe into a dictionary using df.to_dict('records') and put it into a pandas apply user defined function to match via key values but it doesn't seem like a clean approach.

Does someone has an approach to solve that? Any ideas are rewarded.

like image 785
Daniel Avatar asked Sep 05 '25 03:09

Daniel


1 Answers

I think you have a couple different options

  1. You can create a dictionary and use map
  2. You can convert the lists to a string and use replace

Option 1

e = df1.explode('KeyWordGroupID')  # explode youre frame
# create a dictionary from KeyWords and map it to the KeyWordGroupID
e['KeyWords'] = e['KeyWordGroupID'].map(df2.set_index('KeyWordGroupID')['KeyWords'].to_dict())
# merge df1 with e
new_df = df1.merge(e.groupby('RuleSetID')['KeyWords'].agg(list), right_index=True, left_on='RuleSetID')

   RuleSetID RuleSetName   KeyWordGroupID  \
0          0   Standard1  [100, 101, 102]   
1          1   Standard2       [100, 102]   
2          2   Standard3            [103]   

                                    KeyWords  
0  [[word1, word2], [word3], [word4, word5]]  
1           [[word1, word2], [word4, word5]]  
2                           [[word6, word7]]  
like image 115
It_is_Chris Avatar answered Sep 08 '25 01:09

It_is_Chris