Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing values masked array correlation (numpy.ma)

I am trying to use numpy.ma.corrcoef to calculate correlations in the presence of missing data.

According to the documentation: Except for the handling of missing data this function does the same as numpy.corrcoef. For more details and examples, see numpy.corrcoef.

Here is a bivariate dataset, for which only the first and second points have data for both variables.

array([[ 0.00494576, -0.01331578],
   [-0.00146498, -0.01349548],
   [ 0.00430321,         nan],
   [-0.00937105,         nan],
   [        nan, -0.01356873],
   [        nan, -0.01375538],
   [        nan, -0.00277393],
   [        nan,  0.0082988 ],
   [        nan,  0.        ],
   [        nan,  0.00275103],
   [        nan,  0.00547947],
   [        nan, -0.01375538],
   [        nan,  0.0110194 ],
   [        nan, -0.00549452],
   [        nan,  0.01910017],
   [        nan, -0.02462505],
   [        nan, -0.01676017],
   [        nan,  0.0112046 ],
   [        nan,  0.01108045],
   [        nan,  0.01639381],
   [        nan,  0.01078178],
   [        nan, -0.01078178]])

When I cast this as a masked array (np.ma.masked_array(t,np.isnan(t)) where t is the array above) and run np.ma.corrcoef (with rowvar=False) on it the correlation between the variables is given as -86.52 (in absolute value, not percentage!). Whereas running np.corrcoef on the first two points alone produces a correlation of 1 (again absolute value). This latter value is what I think I should expect from the first operation according to the documentation.

My Python version (Enthought 64 bit PyLab on Mac OS X.6.8) information is below and I am using Numpy version 1.6.1.

Python 2.7.3 |EPD 7.3-1 (64-bit)| (default, Apr 12 2012, 11:14:05) Type "copyright", "credits" or "license" for more information.

Please advise on what I am missing here! Thanks in advance.

like image 824
Psmith Avatar asked Jan 26 '26 12:01

Psmith


1 Answers

I think it is probably a bug in numpy.ma.corrcoef (or to be more exact maybe in np.ma.extras._covhelper which I think does not propagate the mask correctly from one column to the other for just single array input, but maybe I was looking at the wrong place).

Use np.ma.corrcoef(b[:,0], b[:,1]) and create a bug report... np.ma.corrcoef(b[:,0], b[:,1]) gives the expected result so its a simple workaround until it is fixed.

like image 51
seberg Avatar answered Jan 28 '26 02:01

seberg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!