Pandas: Keep Column, Count, Drop Duplicates

Question

I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via

df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index() \ .rename(columns={0:'interactions'})

but this leaves me with

   user_id  item_tag_ids  interactions
0      170            71             1
1      170           325             1
2      170           387             1
3      170           474             1
4      170           526             2

It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.

Here is the original structure, I only want to group by IDs:

   user_id  item_tag_ids  item_timestamp
0   406225          7271      1483229353
1   406225          1183      1483229350
2   406225          5930      1483229350
3   406225          7162      1483229350
4   406225          7271      1483229350

I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.

Erfan · Accepted Answer

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
              on=['user_id', 'item_tag_ids'], 
              how='left',
                  suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
   user_id  item_tag_ids  count            item_timestamp
0   406225          7271      2  [1483229353, 1483229350]
1   406225          1183      1              [1483229350]
2   406225          5930      1              [1483229350]
3   406225          7162      1              [1483229350]
4   406225          7271      2  [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
   user_id  item_tag_ids            item_timestamp
0   406225          1183              [1483229350]
1   406225          5930              [1483229350]
2   406225          7162              [1483229350]
3   406225          7271  [1483229353, 1483229350]

Pandas: Keep Column, Count, Drop Duplicates

Tags:

python

pandas

Alexander Hepburn

1 Answers

Erfan

Recent Activity

Donate For Us

Pandas: Keep Column, Count, Drop Duplicates

Tags:

python

pandas

Alexander Hepburn

1 Answers

Erfan

Related questions

Recent Activity

Donate For Us