Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Return count of matching values between two DataFrame variables

Tags:

python

pandas

This is a simple thing but I don't think it's been covered on SO or in the Pandas documentation.

The data

Two DataFrames containing data about cities. The DataFrame UK contains a variable for UK city names. It has been grouped so there are no duplicate city names. This is a truncated example of this city variable, (not the full DataFrame):

city
Hamilton
Edinburgh
Bury
...

The DataFrame US contains a corresponding variable for US city names. It does have duplicate city names, but these are not actually duplicate values as there are numerous cities with the same name. Again, a truncated example:

city
Hamilton
Hamilton
Edinburgh
Edinburgh
Edinburgh
Bury
Bury 
...

(Note: The variables aren't the same length in my data.)

The aim

Create a new variable in the UK DataFrame which is an integer count for each time a city name in the US city variable matches a city name in the UK city variable.

Below shows an example of the output I'm aiming for. Say for one UK city called Hamilton there are 2 cities called Hamilton in the US. For one UK city called Edinburgh there are 3 matches in the US. And 2 matches for Bury.

city       count
Hamilton       2
Edinburgh      3
Bury           2

Where I'm stuck

If I do this...

uk['count'] = uk['city'].isin(us['city']).astype(int)

the new variable is a binary 1 or 0 indicating that there is a match, which is halfway there. However I'm struggling with the Pandas syntax to return the count of matches. I've tried appending value_counts and variations of unique but these didn't work. Maybe I'm approaching this the wrong way.

like image 923
RDJ Avatar asked Sep 15 '25 14:09

RDJ


1 Answers

Use Series.value_counts to count the number of occurrences for each city in US['city'], and then use Series.map to apply those counts to corresponding values in UK['city']:

In [40]: US['city'].value_counts()
Out[40]: 
Edinburgh    3
Bury         2
Hamilton     2
Name: city, dtype: int64

In [41]: UK['count'] = UK['city'].map(US['city'].value_counts())

In [42]: UK
Out[42]: 
        city  count
0   Hamilton      2
1  Edinburgh      3
2       Bury      2

If there is a city in UK['city'] which is not in US['city'] then the above will assign a NaN value to the count column for that city.

like image 161
unutbu Avatar answered Sep 17 '25 05:09

unutbu