I am finding a very strange (IMHO) behaviour with some data loaded into pandas from a CSV file. To protect the innocent, let's state that the DataFrame is in the variable homes and, among others, has the columns below:
In [143]: homes[['zipcode', 'sqft', 'price']].dtypes
Out[143]:
zipcode     int64
sqft        int64
price       int64
dtype: object
To get the average price in each zipcode, I tried:
In [146]: homes.groupby('zipcode')[['price']].mean().head(n=5)
Out[146]:
           price
zipcode
28001     280804
28002     234284
28003     294111
28004    1355927
28005     810164
Strangely enough, the price mean is an int64 as shown by:
In [147]: homes.groupby('zipcode')[['price']].mean().dtypes
Out[147]:
price    int64
dtype: object
I am not able to imagine any technical reason why the mean of some ints is not promoted to float. Even more, just adding another column, makes the price to become a float64 as I expected it to be all the time:
In [148]: homes.groupby('zipcode')[['price', 'sqft']].mean().dtypes
Out[148]:
price       float64
sqft        float64
dtype: object
                  price          sqft
zipcode
28001     280804.690608  14937.450276
28002     234284.035176   7517.633166
28003     294111.278571  10603.096429
28004    1355927.097792  13104.220820
28005     810164.880952  19928.785714
To ensure I was not missing something very obvious, I created another very simple DataFrame (df) but, with this one, this behaviour is not appearing:
In [161]: df[['J','K']].dtypes
Out[161]:
J    int64
K    int64
dtype: object
In [164]: df[['J','K']].head(n=10)
Out[164]:
   J   K
0  0  -9
1  0 -14
2  0   8
3  0 -11
4  0  -7
5 -1   7
6  0   2
7  0   0
8  0   5
9  0   3
In [165]: df.groupby('J')[['K']].mean()
Out[165]:
           K
J
-2 -2.333333
-1  0.466667
 0 -1.030303
 1 -1.750000
 2 -3.000000
Please, note that with a single column, K:int64, grouped by J, another int64, the mean is directly a float. The homes DataFrame was read from
a supplied CSV file, the df one has been created in pandas, written into a CSV and then read back.
Last but not least, I am using pandas 0.16.2.
As suggested by some of you in the comments, this is a bug in pandas. I have just reported it here.
As of now, it has been accepted by the pandas team.
Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With