I have a report_date column that I want to aggregate by its year-week and count the number of unique users.
import pandas as pd
from io import StringIO
datastring = StringIO("""\
report_date user_id
2015-12-01 1
2015-12-01 2
2015-12-01 2
2015-12-02 2
2015-12-02 3
2016-01-01 1
""")
df = pd.read_table(datastring, sep='\s\s+', engine='python')
df['report_date'] = pd.to_datetime(df['report_date'])
Output I want:
2015-48 3
2016-00 1
I have come up with a solution (posted below) but it's relatively slow when using a much larger dataset (> 1MM rows). Curious if there's a better solution to this problem.
As your column is already datetime there is no need to convert to string and groupby on the string, we can group on the year and week component and then just call nunique
:
In [108]:
df.groupby([df['report_date'].dt.year, df['report_date'].dt.week])['user_id'].nunique()
Out[108]:
report_date report_date
2015 49 3
2016 53 1
Name: user_id, dtype: int64
(df.assign(report_week=lambda x: x.report_date.dt.strftime('%Y-%W'))
.groupby('report_week')
.user_id
.nunique()
)
I ended up modifying @EdChum's suggestion to get rid of the cases like '2016-53' when report_date is 2016-01-01
by grouping by the week number modulo 53:
df.groupby([df.report_date.dt.year, df.report_date.dt.week.mod(53)]).user_id.nunique()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With