For data X = [0,0,1,1,0]
and Y = [1,1,0,1,1]
>> np.corrcoef(X,Y)
returns
array([[ 1. , -0.61237244],
[-0.61237244, 1. ]])
However, I cannot reproduce this result using np.var
and np.cov
given the equation shown in http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html:
>> np.cov([0,0,1,1,0],[1,1,0,1,1])/sqrt(np.var([0,0,1,1,0])*np.var([1,1,0,1,1]))
array([[ 1.53093109, -0.76546554],
[-0.76546554, 1.02062073]])
What's going on here?
This is because, np.var
default delta degrees of freedom is 0
, not 1
.
In [57]:
X = [0,0,1,1,0]
Y = [1,1,0,1,1]
np.corrcoef(X,Y)
Out[57]:
array([[ 1. , -0.61237244],
[-0.61237244, 1. ]])
In [58]:
V = np.sqrt(np.array([np.var(X, ddof=1), np.var(Y, ddof=1)])).reshape(1,-1)
np.matrix(np.cov(X,Y))
Out[58]:
matrix([[ 0.3 , -0.15],
[-0.15, 0.2 ]])
In [59]:
np.matrix(np.cov(X,Y))/(V*V.T)
Out[59]:
matrix([[ 1. , -0.61237244],
[-0.61237244, 1. ]])
Or looks it the otherway:
In [70]:
V=np.diag(np.cov(X,Y)).reshape(1,-1) #the diagonal elements
In [71]:
np.matrix(np.cov(X,Y))/np.sqrt(V*V.T)
Out[71]:
matrix([[ 1. , -0.61237244],
[-0.61237244, 1. ]])
What is really going on, np.cov(m, y=None, rowvar=1, bias=0, ddof=None)
, when bias
and ddof
both not provided, the default normalization is by N-1
, N being the number of observation. So, that is equivalent to have delta degrees of freedom of 1
. Unfortunately, the default for np.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)
has the default delta degrees of freedom of 0
.
Whenever unsure, the safest way is to grab the diagonal elements of the covariance matrix rather than calculate var
separately, to ensure consistent behavior.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With