It is difficult to describe this on a heading but given these two DataFrames:
import pandas as pd
import numpy as np
import re
df1 = pd.DataFrame({
'url': [
'http://google.com/car',
'http://google.com/moto',
'http://google.com/moto-bike'
], 'value': [3, 4, 6]})
url value
http://google.com/car 3
http://google.com/moto 4
http://google.com/moto-bike 6
df2 = pd.DataFrame({'name': ['car','moto','bus']})
name
0 car
1 moto
2 bus
I want to see how many times the name on df2 appears on the url for df1, and have sort of managed with:
df2['instances'] = pd.Series([df1.url.str.contains(fr'\D{w}\D', regex=True) \
.sum() for w in df2.name.tolist()])
For some reason car has zero instances cause there is only one.
name instances
0 car 0
1 moto 2
2 bus 0
What I would like to be able to do is to have another column that sums the value column of all matches of df1, so it looks like this:
name instances value_total
0 car 1 3
1 moto 2 10
2 bus 0 0
Any help on the right direction would be greatly appreciated, thanks!
try with str.extract then merge and groupby with named aggregation (new in pandas 0.25+):
pat = '|'.join(df2['name']) #'car|moto|bus'
m = df2.merge(df1.assign(name=df1['url']
.str.extract('('+ pat + ')', expand=False)),on='name',how='left')
m = m.groupby('name',sort=False).agg(instances=('value','count')
,value_total=('value','sum')).reset_index()
print(m)
name instances value_total
0 car 1 3.0
1 moto 2 10.0
2 bus 0 0.0
here's a similair version of anky's answer using .loc, groupby & merge
pat = '|'.join(df2['name'])
df1.loc[df1['url'].str.contains(f'({pat})'),'name'] = df1['url'].str.extract(f'({pat})')[0]
vals = (
df1.groupby("name")
.agg({"name": "count", "value": "sum"})
.rename(columns={"name": "instance"})
.reset_index()
)
new_df = pd.merge(df2,vals,on='name',how='left').fillna(0)
print(new_df)
name instance value
0 car 1.0 3.0
1 moto 2.0 10.0
2 bus 0.0 0.0
edit, if you need an extact match of car then we can add word boundaries:
pat = r'|'.join(np.where(df2['name'].str.contains('car'),
r'\b' + df2['name'] + r'\b', df2['name']))
print(df1)
url value
0 http://google.com/car 3
1 http://google.com/motor 4
2 http://google.com/carousel 6
3 http://google.com/bus 8
df1.loc[df1['url'].str.contains(f'{pat}'),'name'] = df1['url'].str.extract(f'({pat})')[0]
print(df1)
url value name
0 http://google.com/car 3 car
1 http://google.com/motor 4 moto
2 http://google.com/carousel 6 NaN
3 http://google.com/bus 8 bus
if you want exact matches for all then just add word boundries to pattern :
pat = '|'.join(r'\b' + df2['name'] + r'\b')
#'\\bcar\\b|\\bmoto\\b|\\bbus\\b'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With