I have a pandas dataframe with the following format:
| UserId | Date | BookId |
|---|---|---|
| 1 | 2022-07-15 | 10 |
| 1 | 2022-07-16 | 11 |
| 1 | 2022-07-16 | 12 |
| 1 | 2022-07-17 | 12 |
From this table, what I want to obtain is the number of new BookId on each consecutive day for each user. For example, based on the table above, the user read two new books on 2022-07-16, and did not read a new book on 2022-07-17 since s/he read it already on the previous day. Here is the expected outcome:
| UserId | 2022-07-16 | 2022-07-17 |
|---|---|---|
| 1 | 2 | 0 |
I feel like this task could be done by grouping data by UserId and Date, and then using the apply lambda function. However, I could not manage it. I ended up with the following code, which uses the for loop. Is there a way to achieve this without a loop with a shorter code?
df = studentAnswers.groupby('StudentId')
df.apply(findObjDiff)
def findObjDiff(df):
print(df.StudentId.head(3))
dataDict = {}
dates = list(df.Date)
dates.sort()
for d in dates:
ixNext = dates.index(d) + 1
if(ixNext > len(dates)):
break
dateNext = dates[ixNext]
objListPrev = set(df[df.Date == d].ObjectiveId)
objListNext = set(df[df.Date == dateNext].ObjectiveId)
dataDict[df.StudentId] = {dateNext : {'Different': len(objListPrev - objListNext)}}
return dataDict
Using duplicated and a pivot_table:
(df.assign(count=~df['BookId'].duplicated())
.pivot_table(index='UserId', columns='Date', values='count', aggfunc='sum')
.astype(int).reset_index().rename_axis(columns=None)
)
Considering only consecutive days for the duplicates:
s = df.groupby(pd.to_datetime(df['Date']))['BookId'].agg(set)
(df.assign(count=(s-s.shift(1, freq='D')).str.len().to_numpy())
.pivot_table(index='UserId', columns='Date', values='count', aggfunc='sum')
.astype(int).reset_index().rename_axis(columns=None)
)
Output:
UserId 2022-07-15 2022-07-16 2022-07-17
0 1 1 2 0
Here is a solution:
(df.loc[df.sort_values('Date')
.groupby(['UserId','BookId'])['Date'].transform(lambda x: x.diff().dt.days.ne(1))]
.groupby(['UserId','Date'])['BookId'].nunique()
.reindex(pd.MultiIndex.from_product([df['UserId'].unique(),df['Date'].unique()],names = ['User',None]),fill_value=0)
.unstack())
Output:
2022-07-15 2022-07-16 2022-07-17
User
1 1 2 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With