Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to groupby two fields in pandas?

Given following input, the goal is to group values by hour for each Date with Avg and Sum functions. Solution to grouping it by hour is here, but it does not consider new days.

Date        Time    F1  F2  F3
21-01-16    8:11    5   2   4
21-01-16    9:25    9   8   2
21-01-16    9:39    7   3   2
21-01-16    9:53    6   5   1
21-01-16    10:07   4   6   7
21-01-16    10:21   7   3   1
21-01-16    10:35   5   6   7
21-01-16    11:49   1   2   1
21-01-16    12:03   3   3   1
22-01-16    9:45    6   5   1
22-01-16    9:20    4   6   7
22-01-16    12:10   7   3   1

Expected output:

Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1
like image 972
user52028778 Avatar asked Dec 05 '25 17:12

user52028778


2 Answers

You can do the parsing of dates during reading of the csv file:

from __future__ import print_function # make it work with Python 2 and 3

df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
                 delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))

Output:

                       F1            F2            F3    
                     mean sum      mean sum      mean sum
Date       Time                                          
2016-01-21 8     5.000000   5  2.000000   2  4.000000   4
           9     7.333333  22  5.333333  16  1.666667   5
           10    5.333333  16  5.000000  15  5.000000  15
           11    1.000000   1  2.000000   2  1.000000   1
           12    3.000000   3  3.000000   3  1.000000   1
2016-01-22 9     5.000000  10  5.500000  11  4.000000   8
           12    7.000000   7  3.000000   3  1.000000   1

All the way into csv:

from __future__ import print_function

df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
                 delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))

Output:

Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1 
like image 105
Mike Müller Avatar answered Dec 08 '25 06:12

Mike Müller


You cas use convert time to datetime by to_datetime and then groupby with agg:

print df    
         Date   Time  F1  F2  F3
0  2016-01-21   8:11   5   2   4
1  2016-01-21   9:25   9   8   2
2  2016-01-21   9:39   7   3   2
3  2016-01-21   9:53   6   5   1
4  2016-01-21  10:07   4   6   7
5  2016-01-21  10:21   7   3   1
6  2016-01-21  10:35   5   6   7
7  2016-01-21  11:49   1   2   1
8  2016-01-21  12:03   3   3   1
9  2016-01-22   9:45   6   5   1
10 2016-01-22   9:20   4   6   7
11 2016-01-22  12:10   7   3   1

df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
         Date                Time  F1  F2  F3
0  2016-01-21 1900-01-01 08:11:00   5   2   4
1  2016-01-21 1900-01-01 09:25:00   9   8   2
2  2016-01-21 1900-01-01 09:39:00   7   3   2
3  2016-01-21 1900-01-01 09:53:00   6   5   1
4  2016-01-21 1900-01-01 10:07:00   4   6   7
5  2016-01-21 1900-01-01 10:21:00   7   3   1
6  2016-01-21 1900-01-01 10:35:00   5   6   7
7  2016-01-21 1900-01-01 11:49:00   1   2   1
8  2016-01-21 1900-01-01 12:03:00   3   3   1
9  2016-01-22 1900-01-01 09:45:00   6   5   1
10 2016-01-22 1900-01-01 09:20:00   4   6   7
11 2016-01-22 1900-01-01 12:10:00   7   3   1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
        Date Time        F1            F2            F3    
                       mean sum      mean sum      mean sum
0 2016-01-21    8  5.000000   5  2.000000   2  4.000000   4
1 2016-01-21    9  7.333333  22  5.333333  16  1.666667   5
2 2016-01-21   10  5.333333  16  5.000000  15  5.000000  15
3 2016-01-21   11  1.000000   1  2.000000   2  1.000000   1
4 2016-01-21   12  3.000000   3  3.000000   3  1.000000   1
5 2016-01-22    9  5.000000  10  5.500000  11  4.000000   8
6 2016-01-22   12  7.000000   7  3.000000   3  1.000000   1

And then you can set column names by list comprehension:

levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in  zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df

        Date   Time   F1 mean  F1 sum   F2 mean  F2 sum   F3 mean  F3 sum
0 2016-01-21      8  5.000000       5  2.000000       2  4.000000       4
1 2016-01-21      9  7.333333      22  5.333333      16  1.666667       5
2 2016-01-21     10  5.333333      16  5.000000      15  5.000000      15
3 2016-01-21     11  1.000000       1  2.000000       2  1.000000       1
4 2016-01-21     12  3.000000       3  3.000000       3  1.000000       1
5 2016-01-22      9  5.000000      10  5.500000      11  4.000000       8
6 2016-01-22     12  7.000000       7  3.000000       3  1.000000       1
like image 35
jezrael Avatar answered Dec 08 '25 06:12

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!