Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : How to aggregate hourly count with time start and end

Tags:

python

pandas

I have a dataframe with start and end time for each unique rating ID.

d={'ID':['01','02','03','04','05','06'],'Hour Start':[5,9,13,15,20,23],'Hour End':[6,9,15,19,0,2]}
df=pd.DataFrame(data=d)

My goal is to aggregate how many ratings were active hourly for the whole dataset. For example, ID:01 started during 5 am and 6 am. Then 5 am and 6 am should both add 1 count each.

But for ID:06, the rating started in 11 pm and ended next day at 2 am. Hence each hour should add 1 count each hour from 11 pm to 2 am.

I want to output a table for hourly summary like below.

enter image description here

I have been thinking a while for a solution.

Any help would be very appreciated ! Thank you !

like image 366
C4TNT Avatar asked Sep 06 '25 03:09

C4TNT


2 Answers

You can convert to datetime both the hour start and end columns. Then you compute the difference in time. Finally, you convert the time difference to difference in hours (divide the seconds by 3600):

df['Hours_s'] = pd.to_datetime(df['Hour Start'], format='%H' )
df['Hours_e'] = pd.to_datetime(df['Hour End'], format='%H' )
df['delta'] = df['Hours_e']-df['Hours_s']
df["count"] = df["delta"].apply(lambda x: x.seconds//3600)

Output:

ID   Hour_Start Hour_End count
0          5       6       1
1          9       9       0
2          13      15      2
3          15      19      4
4          20      0       4
5          23      2       3

UPDATE:

final_tab = pd.DataFrame({"Hour": range(0,24), "Count": [0]*24})

for i, row in df.iterrows():
    if row["delta"].days != 0:
        final_tab.iloc[row["Hour Start"]:24,1] =final_tab.iloc[row["Hour Start"]:24,1] +1
        final_tab.iloc[0:row["Hour End"]+1,1] =final_tab.iloc[0:row["Hour End"]+1,1] +1
    else:
        final_tab.iloc[row["Hour Start"]:row["Hour Start"]+row["count"],1] = final_tab.iloc[row["Hour Start"]:row["Hour Start"]+row["count"],1] + 1

Output:

print(final_tab)
   Hour Count
0   0   2
1   1   1
2   2   1
3   3   0
4   4   0
5   5   1
6   6   1
7   7   0
8   8   0
9   9   1
10  10  0
11  11  0
12  12  0
13  13  1
14  14  1
15  15  2
16  16  1
17  17  1
18  18  1
19  19  1
20  20  1
21  21  1
22  22  1
23  23  2
like image 116
DavideBrex Avatar answered Sep 07 '25 19:09

DavideBrex


IIUC, you can do it like this using pd.to_datetime and pd.date_range:

#Convert hours to datetime
df['endTime'] = pd.to_datetime(df['Hour End'], format='%H')
df['startTime'] = pd.to_datetime(df['Hour Start'], format='%H')

#If 'Hour End' less thn 'Hour Start' assume next day
df['endTime'] = np.where(df['Hour End'] < df['Hour Start'], 
                         df['endTime']+pd.Timedelta(days=1), 
                         df['endTime'])

#Create a series of hours per defined ranges ('Hour Start' to 'Hour End')
df_hourly = df.apply(lambda x: pd.Series(pd.date_range(x['startTime'], 
                                                       x['endTime'], 
                                                       freq='H')), 
                                         axis=1)\
              .stack().dt.hour

#Use value counts to count the hours and reindex to 24-hour day to fill missing hours.
df_hourly.value_counts().reindex(np.arange(0,24)).fillna(0).astype(int)

Output:

0     2
1     1
2     1
3     0
4     0
5     1
6     1
7     0
8     0
9     1
10    0
11    0
12    0
13    1
14    1
15    2
16    1
17    1
18    1
19    1
20    1
21    1
22    1
23    2

Alternatively, using explode and value_counts:

df.apply(lambda x: pd.date_range(x['startTime'], 
                                 x['endTime'], 
                                 freq='H'), axis=1)\
  .explode().dt.hour.value_counts()\
  .reindex(np.arange(0,24), fill_value=0)
like image 44
Scott Boston Avatar answered Sep 07 '25 21:09

Scott Boston