Generating a column showing the number of distinct values between consecutive days

Question

I have a pandas dataframe with the following format:

UserId	Date	BookId
1	2022-07-15	10
1	2022-07-16	11
1	2022-07-16	12
1	2022-07-17	12

From this table, what I want to obtain is the number of new BookId on each consecutive day for each user. For example, based on the table above, the user read two new books on 2022-07-16, and did not read a new book on 2022-07-17 since s/he read it already on the previous day. Here is the expected outcome:

UserId	2022-07-16	2022-07-17
1	2	0

I feel like this task could be done by grouping data by UserId and Date, and then using the apply lambda function. However, I could not manage it. I ended up with the following code, which uses the for loop. Is there a way to achieve this without a loop with a shorter code?

df = studentAnswers.groupby('StudentId')
df.apply(findObjDiff)

def findObjDiff(df):
    print(df.StudentId.head(3))
    dataDict = {}
    dates = list(df.Date)
    dates.sort()
    for d in dates:
        ixNext = dates.index(d) + 1
        if(ixNext > len(dates)):
            break
        dateNext = dates[ixNext]
        objListPrev = set(df[df.Date == d].ObjectiveId)
        objListNext = set(df[df.Date == dateNext].ObjectiveId)
        dataDict[df.StudentId] = {dateNext : {'Different': len(objListPrev - objListNext)}}

    return dataDict

mozway · Accepted Answer

Using duplicated and a pivot_table:

(df.assign(count=~df['BookId'].duplicated())
   .pivot_table(index='UserId', columns='Date', values='count', aggfunc='sum')
   .astype(int).reset_index().rename_axis(columns=None)
)

Considering only consecutive days for the duplicates:

s = df.groupby(pd.to_datetime(df['Date']))['BookId'].agg(set)

(df.assign(count=(s-s.shift(1, freq='D')).str.len().to_numpy())
   .pivot_table(index='UserId', columns='Date', values='count', aggfunc='sum')
   .astype(int).reset_index().rename_axis(columns=None)
)

Output:

   UserId  2022-07-15  2022-07-16  2022-07-17
0       1           1           2           0

rhug123 · Answer

Here is a solution:

(df.loc[df.sort_values('Date')
.groupby(['UserId','BookId'])['Date'].transform(lambda x: x.diff().dt.days.ne(1))]
.groupby(['UserId','Date'])['BookId'].nunique()
.reindex(pd.MultiIndex.from_product([df['UserId'].unique(),df['Date'].unique()],names = ['User',None]),fill_value=0)
.unstack())

Output:

      2022-07-15  2022-07-16  2022-07-17
User                                    
1              1           2           0

Generating a column showing the number of distinct values between consecutive days

Tags:

python

pandas

dataframe

renakre

2 Answers

mozway

rhug123

Recent Activity

Donate For Us

Generating a column showing the number of distinct values between consecutive days

Tags:

python

pandas

dataframe

renakre

2 Answers

mozway

rhug123

Related questions

Recent Activity

Donate For Us