Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a column showing the number of distinct values between consecutive days

I have a pandas dataframe with the following format:

UserId Date BookId
1 2022-07-15 10
1 2022-07-16 11
1 2022-07-16 12
1 2022-07-17 12

From this table, what I want to obtain is the number of new BookId on each consecutive day for each user. For example, based on the table above, the user read two new books on 2022-07-16, and did not read a new book on 2022-07-17 since s/he read it already on the previous day. Here is the expected outcome:

UserId 2022-07-16 2022-07-17
1 2 0

I feel like this task could be done by grouping data by UserId and Date, and then using the apply lambda function. However, I could not manage it. I ended up with the following code, which uses the for loop. Is there a way to achieve this without a loop with a shorter code?

df = studentAnswers.groupby('StudentId')
df.apply(findObjDiff)

def findObjDiff(df):
    print(df.StudentId.head(3))
    dataDict = {}
    dates = list(df.Date)
    dates.sort()
    for d in dates:
        ixNext = dates.index(d) + 1
        if(ixNext > len(dates)):
            break
        dateNext = dates[ixNext]
        objListPrev = set(df[df.Date == d].ObjectiveId)
        objListNext = set(df[df.Date == dateNext].ObjectiveId)
        dataDict[df.StudentId] = {dateNext : {'Different': len(objListPrev - objListNext)}}

    return dataDict
like image 337
renakre Avatar asked Jan 18 '26 00:01

renakre


2 Answers

Using duplicated and a pivot_table:

(df.assign(count=~df['BookId'].duplicated())
   .pivot_table(index='UserId', columns='Date', values='count', aggfunc='sum')
   .astype(int).reset_index().rename_axis(columns=None)
)

Considering only consecutive days for the duplicates:

s = df.groupby(pd.to_datetime(df['Date']))['BookId'].agg(set)

(df.assign(count=(s-s.shift(1, freq='D')).str.len().to_numpy())
   .pivot_table(index='UserId', columns='Date', values='count', aggfunc='sum')
   .astype(int).reset_index().rename_axis(columns=None)
)

Output:

   UserId  2022-07-15  2022-07-16  2022-07-17
0       1           1           2           0
like image 139
mozway Avatar answered Jan 20 '26 14:01

mozway


Here is a solution:

(df.loc[df.sort_values('Date')
.groupby(['UserId','BookId'])['Date'].transform(lambda x: x.diff().dt.days.ne(1))]
.groupby(['UserId','Date'])['BookId'].nunique()
.reindex(pd.MultiIndex.from_product([df['UserId'].unique(),df['Date'].unique()],names = ['User',None]),fill_value=0)
.unstack())

Output:

      2022-07-15  2022-07-16  2022-07-17
User                                    
1              1           2           0
like image 41
rhug123 Avatar answered Jan 20 '26 14:01

rhug123



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!