Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Get pattern that matches a url between two dataframes

I have 2 dataframes of the type,

d1 = {'Domain': ['amazon.com', 'apple.com', 'amazon.com','xyz.com'], 'Pattern': ['kindle','music','subscribe-and-save',''],'Other Important Info':['a','b','c','d']}
df1 = pd.DataFrame(d1)

d2 = {'Domain': ['google.com','google.com','amazon.com','amazon.com', 'youtube.com', 'amazon.com'], 'Url': ['https://google.com/kindle','https://google.com/','https://amazon.com/subscribe-and-save','https://amazon.com/abc','https://youtube.com/music','https:amazon.com/kindle']}
df2 = pd.DataFrame(d2)

The main aim is to merge the two dataframes based on the 'Domain' and also when 'Pattern' is in 'Url'.

So the result should be the following dataframe

{'Domain':['amazon.com','amazon.com'],'Url':['https://amazon.com/subscribe-and-save','https:amazon.com/kindle'],'Other Important Info':['c','a']}

How I'm doing it currently is,

def lookup_table(value, df):
    out = None
    list_items = df['Pattern'].tolist()
    for item in list_items:
        if item in value:
            out = item
            break
    return out

df2['Pattern'] = df2['url'].apply(lambda x: lookup_table(x, df1[df1['Pattern']!='']))

merged = pd.merge(df2[df2['Pattern'].notnull()], df1[df1['Pattern']!=''],on=['Domain','Pattern'],how='left')

However the lookup_table function is taking way too long to run because of the for loop

How can I do this faster? Using Python 2 on windows.

like image 783
inquisitiveProgrammer Avatar asked Nov 30 '25 02:11

inquisitiveProgrammer


1 Answers

df1

       Domain             Pattern Other Important Info
0  amazon.com              kindle                    a
1   apple.com               music                    b
2  amazon.com  subscribe-and-save                    c
3     xyz.com                                         

df2

        Domain                                    Url
0   google.com              https://google.com/kindle
1   google.com                    https://google.com/
2   amazon.com  https://amazon.com/subscribe-and-save
3   amazon.com                 https://amazon.com/abc
4  youtube.com              https://youtube.com/music
5   amazon.com                https:amazon.com/kindle

The main aim is to merge the two dataframes based on the 'Domain' and also when 'Pattern' is in 'Url'.

df = df1.merge(df2, on='Domain')
df.loc[df.apply(lambda x: x.Pattern in x.Url, axis=1)]

Output

       Domain             Pattern Other Important Info  \
2  amazon.com              kindle                    a   
3  amazon.com  subscribe-and-save                    c   

                                     Url  
2                https:amazon.com/kindle  
3  https://amazon.com/subscribe-and-save  
like image 134
iamklaus Avatar answered Dec 03 '25 17:12

iamklaus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!