I'm trying to fin Mean, Variance and SD using pandas. However, manual calcuation is different from that of pandas output. is there anything i'm missing using pandas. Attached the xl screenshot for reference
import pandas as pd
dg_df = pd.DataFrame(
data=[600,470,170,430,300],
index=['a','b','c','d','e'])
print(dg_df.mean(axis=0)) # 394.0 matches with manual calculation
print(dg_df.var()) # 27130.0 not matching with manual calculation 21704
print(dg_df.std(axis=0)) # 164.71187 not matching with manual calculation 147.32
There is more than one definition of standard deviation. You are calculating the equivalent of Excel STDEV.P, which has the description: "Calculates standard deviation based on the entire population...". If you need sample standard deviation in Excel use STDEV.S.
pd.DataFrame.std assumes 1 degree of freedom by default, also known as sample standard deviation.
numpy.std assumes 0 degree of freedom by default, also known as population standard deviation.
See Bessel's correction to understand the difference between sample and population.
You can also specify ddof=0 with Pandas std / var methods:
dg_df.std(ddof=0)
dg_df.var(ddof=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With