Python grouping and getting the Average, Minimum and Maximum values

Question

I have a csv data set that looks like this:

Class,  Code,   Vendor, State,  NumberOfDays
3,      123,    Name1,  NE,     12.58402778
1,      876,    Name2,  TX,     12.51041667
3,      123,    Name1,  NE,     2.354166667
1,      876,    Name2,  TX,     12.21111111
3,      456,    Name2,  NY,     6.346527778
2,      876,    Name1,  NY,     5.513194444
3,      123,    Name1,  NE,     5.38125
1,      876,    Name2,  TX,     5.409722222

I have code the following code:

df = pd.read_csv(r'C:\Python36\Data	esting\LowHighMean.csv')
df2 = df.groupby(['Class','Code','Vendor','State'])['NumberOfDays'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv(r'C:\Python36\Data	esting\output.csv')

Which works great in getting me the average 'NumberOfDays' by grouping the other fields:

1,876,Name2,TX,10.04
2,876,Name1,NY,5.51
3,123,Name1,NE,6.77
3,456,Name2,NY,6.35

I cannot seem to carry over the headers but it is not a big deal I just put the headers in on another step. The issue I am trying to solve for is to add columns which would provide the lowest min() and highest max() value. I am looking to create this:

Class,  Code,   Vendor, State,  AverageDays, LowestNumberOfDays,    HighestNumberOfDays
1,      876,    Name2,  TX,     10.04,       5.41                   12.51             
2,      876,    Name1,  NY,     5.51,        5.51                   5.51
3,      123,    Name1,  NE,     6.77,        2.35                   12.58
3,      456,    Name2,  NY,     6.35,        6.35                   6.35

The size of the starting data files are over 3 gig and over 30 million records. The file size becomes much smaller after the transformation. Due to the starting file size I am trying to figure out a way where I can avoid doing this in four different steps. 3 separate steps/runs to get mean(), max() and min() and then a forth run to combine them. Since I am a noob I don't even know how to do this with out setting up 4 sets of code and running the file 4 individual times.

jezrael · Accepted Answer

Use aggregation by agg, then is necessary rename columns:

d = {'mean':'AverageDays','min':'LowestNumberOfDays','max':'HighestNumberOfDays'}
df = (df.groupby(['Class','Code','Vendor','State'])['NumberOfDays']
        .agg(['mean','min','max'])
        .rename(columns=d)
        .reset_index())
print (df)
   Class  Code Vendor State  AverageDays  LowestNumberOfDays  \
0      1   876  Name2    TX    10.043750            5.409722   
1      2   876  Name1    NY     5.513194            5.513194   
2      3   123  Name1    NE     6.773148            2.354167   
3      3   456  Name2    NY     6.346528            6.346528   

   HighestNumberOfDays  
0            12.510417  
1             5.513194  
2            12.584028  
3             6.346528

Thanks for alternative solution, Bharath shetty:

df = df.pivot_table(index=['Class','Code','Vendor','State'],
                    value‌s='NumberOfDays',
                    agg‌func=('min','mean','‌max'))
        .rename(column‌s=d)
        .reset_index()

Python grouping and getting the Average, Minimum and Maximum values

Tags:

python

pandas

csv

John Minze

1 Answers

jezrael

Recent Activity

Donate For Us

Python grouping and getting the Average, Minimum and Maximum values

Tags:

python

pandas

csv

John Minze

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us