I have a csv data set that looks like this:
Class, Code, Vendor, State, NumberOfDays
3, 123, Name1, NE, 12.58402778
1, 876, Name2, TX, 12.51041667
3, 123, Name1, NE, 2.354166667
1, 876, Name2, TX, 12.21111111
3, 456, Name2, NY, 6.346527778
2, 876, Name1, NY, 5.513194444
3, 123, Name1, NE, 5.38125
1, 876, Name2, TX, 5.409722222
I have code the following code:
df = pd.read_csv(r'C:\Python36\Data\testing\LowHighMean.csv')
df2 = df.groupby(['Class','Code','Vendor','State'])['NumberOfDays'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv(r'C:\Python36\Data\testing\output.csv')
Which works great in getting me the average 'NumberOfDays' by grouping the other fields:
1,876,Name2,TX,10.04
2,876,Name1,NY,5.51
3,123,Name1,NE,6.77
3,456,Name2,NY,6.35
I cannot seem to carry over the headers but it is not a big deal I just put the headers in on another step. The issue I am trying to solve for is to add columns which would provide the lowest min() and highest max() value. I am looking to create this:
Class, Code, Vendor, State, AverageDays, LowestNumberOfDays, HighestNumberOfDays
1, 876, Name2, TX, 10.04, 5.41 12.51
2, 876, Name1, NY, 5.51, 5.51 5.51
3, 123, Name1, NE, 6.77, 2.35 12.58
3, 456, Name2, NY, 6.35, 6.35 6.35
The size of the starting data files are over 3 gig and over 30 million records. The file size becomes much smaller after the transformation. Due to the starting file size I am trying to figure out a way where I can avoid doing this in four different steps. 3 separate steps/runs to get mean(), max() and min() and then a forth run to combine them. Since I am a noob I don't even know how to do this with out setting up 4 sets of code and running the file 4 individual times.
Use aggregation by agg, then is necessary rename columns:
d = {'mean':'AverageDays','min':'LowestNumberOfDays','max':'HighestNumberOfDays'}
df = (df.groupby(['Class','Code','Vendor','State'])['NumberOfDays']
.agg(['mean','min','max'])
.rename(columns=d)
.reset_index())
print (df)
Class Code Vendor State AverageDays LowestNumberOfDays \
0 1 876 Name2 TX 10.043750 5.409722
1 2 876 Name1 NY 5.513194 5.513194
2 3 123 Name1 NE 6.773148 2.354167
3 3 456 Name2 NY 6.346528 6.346528
HighestNumberOfDays
0 12.510417
1 5.513194
2 12.584028
3 6.346528
Thanks for alternative solution, Bharath shetty:
df = df.pivot_table(index=['Class','Code','Vendor','State'],
values='NumberOfDays',
aggfunc=('min','mean','max'))
.rename(columns=d)
.reset_index()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With