Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Keep Column, Count, Drop Duplicates

Tags:

python

pandas

I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via

df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index() \ .rename(columns={0:'interactions'})

but this leaves me with

   user_id  item_tag_ids  interactions
0      170            71             1
1      170           325             1
2      170           387             1
3      170           474             1
4      170           526             2

It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.

Here is the original structure, I only want to group by IDs:

   user_id  item_tag_ids  item_timestamp
0   406225          7271      1483229353
1   406225          1183      1483229350
2   406225          5930      1483229350
3   406225          7162      1483229350
4   406225          7271      1483229350

I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.

like image 776
Alexander Hepburn Avatar asked Oct 19 '25 05:10

Alexander Hepburn


1 Answers

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
              on=['user_id', 'item_tag_ids'], 
              how='left',
                  suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
   user_id  item_tag_ids  count            item_timestamp
0   406225          7271      2  [1483229353, 1483229350]
1   406225          1183      1              [1483229350]
2   406225          5930      1              [1483229350]
3   406225          7162      1              [1483229350]
4   406225          7271      2  [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
   user_id  item_tag_ids            item_timestamp
0   406225          1183              [1483229350]
1   406225          5930              [1483229350]
2   406225          7162              [1483229350]
3   406225          7271  [1483229353, 1483229350]
like image 160
Erfan Avatar answered Oct 21 '25 20:10

Erfan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!