Pandas rank based on several columns

Question

I have the following dataframe :

event_id  occurred_at  user_id
   19148   2015-10-01        1
   19693   2015-10-05        2
   20589   2015-10-12        1
   20996   2015-10-15        1
   20998   2015-10-15        1
   23301   2015-10-23        2
   23630   2015-10-26        1
   25172   2015-11-03        1
   31699   2015-12-11        1
   32186   2015-12-14        2
   43426   2016-01-13        1
   68300   2016-04-04        2
   71926   2016-04-19        1

I would like to rank the events by chronological order (1 to n), for each user.

I can achieve this by doing :

df.groupby('user_id')['occurred_at'].rank(method='dense')

However, for those 2 lines, that occurred on the same date (for the same user), I end up with the same rank :

   20996   2015-10-15        1
   20998   2015-10-15        1

In case the event date is the same, I would like to compare the event_id and arbitrarily rank lower the event with the lowest event_id. How can I achieve this easily ?

I can post process the ranks to make sure every rank is only used once, but this seems pretty bulky...

Edit : how to reproduce :

Copy paste the data in data.csv file. Then :

import pandas as pd
df = pd.read_csv('data.csv', delim_whitespace=True)
df['rank'] = df.groupby('user_id')['occurred_at'].rank(method='dense')
>>> df[df['user_id'] == 1]
    event_id occurred_at  user_id  rank
0      19148  2015-10-01        1   1.0
2      20589  2015-10-12        1   2.0
3      20996  2015-10-15        1   3.0 <--
4      20998  2015-10-15        1   3.0 <--
6      23630  2015-10-26        1   4.0
7      25172  2015-11-03        1   5.0
8      31699  2015-12-11        1   6.0
10     43426  2016-01-13        1   7.0
12     71926  2016-04-19        1   8.0

Am using python3 and pandas 0.18.1

piRSquared · Accepted Answer

sort_values('event_id') prior to grouping then pass method='first' to rank

Also note that if occurred_at isn't already datetime, make it datetime.

# unnecessary if already datetime, but doesn't hurt to do it anyway
df.occurred_at = pd.to_datetime(df.occurred_at) 

df['rank'] = df.sort_values('event_id') \
                 .groupby('user_id').occurred_at \
                 .rank(method='first')
df

enter image description here

Reference for complete verifiable code

from StringIO import StringIO
import pandas as pd

text = """event_id  occurred_at  user_id
   19148   2015-10-01        1
   19693   2015-10-05        2
   20589   2015-10-12        1
   20996   2015-10-15        1
   20998   2015-10-15        1
   23301   2015-10-23        2
   23630   2015-10-26        1
   25172   2015-11-03        1
   31699   2015-12-11        1
   32186   2015-12-14        2
   43426   2016-01-13        1
   68300   2016-04-04        2
   71926   2016-04-19        1"""

df = pd.read_csv(StringIO(text), delim_whitespace=True)

df.occurred_at = pd.to_datetime(df.occurred_at) 

df['rank'] = df.sort_values('event_id').groupby('user_id').occurred_at.rank(method='first')

df

Pandas rank based on several columns

Tags:

python

python-3.x

pandas

3kt

1 Answers

Reference for complete verifiable code

piRSquared

Recent Activity

Donate For Us

Pandas rank based on several columns

Tags:

python

python-3.x

pandas

3kt

1 Answers

Reference for complete verifiable code

piRSquared

Related questions

Recent Activity

Donate For Us