Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas group by timestamp and id and count

I have a dataframe in the following format:

import pandas as pd
d1 = {'ID': ['A','A','A','B','B','B','B','B','C'], 
'Time': 
['1/18/2016','2/17/2016','2/16/2016','1/15/2016','2/14/2016','2/13/2016',
'1/12/2016','2/9/2016','1/11/2016'],
'Product_ID': ['2','1','1','1','1','2','1','2','2'], 
'Var_1': [0.11,0.22,0.09,0.07,0.4,0.51,0.36,0.54,0.19],
'Var_2': [1,0,1,0,1,0,1,0,1],
'Var_3': ['1','1','1','1','0','1','1','0','0']}
df1 = pd.DataFrame(d1)

Where df1 is of the form:

ID  Time        Product_ID  Var_1   Var_2   Var_3
A   1/18/2016   2           0.11    1       1
A   2/17/2016   1           0.22    0       1
A   2/16/2016   1           0.09    1       1
B   1/15/2016   1           0.07    0       1
B   2/14/2016   1           0.4     1       0
B   2/13/2016   2           0.51    0       1
B   1/12/2016   1           0.36    1       1
B   2/9/2016    2           0.54    0       0
C   1/11/2016   2           0.19    1       0

where time is in 'MM/DD/YY' format.

This is what I have to do:

1) I would like to do is to group ID's and Product ID's by Time (Specifically by each Month).

2) I want to then carry out the following column operations.
a) First, I would like to find the sum of the columns of Var_2 and Var_3 and
b) find the mean of the column Var_1.

3) Then, I would like to create a column of count of each ID and Product_ID for each month.

4) And finally, I would also like to input items ID and Product ID for which there is no entries.

For example, for ID = A and Product ID = 1 in Time = 2016-1 (January 2016), there are no observations and thus all variables take the value of 0.

Again, For ID = A and Product ID = 1 in Time = 2016-2 (January 2016),
Var_1 = (.22+.09)/2 = 0.155
Var_2 = 1,
Var_3 = 1+1=2
and finally Count = 2.

This is the output that I would like.

ID  Product_ID  Time    Var_1   Var_2   Var_3   Count
A   1           2016-1  0       0       0       0
A   1           2016-2  0.155   1       2       2
B   1           2016-1  0.215   1       1       2
B   1           2016-2  1       0.4     0       1
C   1           2016-1  0       0       0       0
C   1           2016-2  0       0       0       0
A   2           2016-1  0.11    1       1       1
A   2           2016-2  0       0       0       0
B   2           2016-1  0       0       0       0
B   2           2016-2  0.455   1       2       2
C   2           2016-1  0.19    1       0       1
C   2           2016-2  0       0       0       0

This is a little more than my programming capabilities (I know the groupby function exits but I could not figure out how to incorporate the rest of the changes). Please let me know if you have questions.

Any help will be appreciated. Thanks.

like image 780
Prometheus Avatar asked Sep 15 '25 15:09

Prometheus


1 Answers

I break down the steps.

df1.Time=pd.to_datetime(df1.Time)
df1.Time=df1.Time.dt.month+df1.Time.dt.year*100
df1['Var_3']=df1['Var_3'].astype(int)

output=df1.groupby(['ID','Product_ID','Time']).agg({'Var_1':'mean','Var_2':'sum','Var_3':'sum'})
output=output.unstack(2).stack(dropna=False).fillna(0)# missing one .


output['Count']=output.max(1)
output.reset_index().sort_values(['Product_ID','ID'])


Out[1032]: 
  ID Product_ID    Time  Var_3  Var_2  Var_1  Count
0  A          1  201601    0.0    0.0  0.000    0.0
1  A          1  201602    2.0    1.0  0.155    2.0
4  B          1  201601    2.0    1.0  0.215    2.0
5  B          1  201602    0.0    1.0  0.400    1.0
2  A          2  201601    1.0    1.0  0.110    1.0
3  A          2  201602    0.0    0.0  0.000    0.0
6  B          2  201601    0.0    0.0  0.000    0.0
7  B          2  201602    1.0    0.0  0.525    1.0
8  C          2  201601    0.0    1.0  0.190    1.0
9  C          2  201602    0.0    0.0  0.000    0.0
like image 57
BENY Avatar answered Sep 18 '25 04:09

BENY