I have a table (X, Y) where X is a matrix and Y is a vector of classes. Here an example:
X = 0 0 1 0 1 and Y = 1
0 1 0 0 0 1
1 1 1 0 1 0
I want to use Mann-Whitney U test to compute the feature importance(feature selection)
from scipy.stats import mannwhitneyu
results = np.zeros((X.shape[1],2))
for i in xrange(X.shape[1]):
u, prob = mannwhitneyu(X[:,i], Y)
results[i,:] = u, pro
I'm not sure if this is correct or no? I obtained large values for a large table, u = 990 for some columns.
I don't think that using Mann-Whitney U test is a good way to do feature selection. Mann-Whitney tests whether distributions of the two variable are the same, it tells you nothing about how correlated the variables are. For example:
>>> from scipy.stats import mannwhitneyu
>>> a = np.arange(100)
>>> b = np.arange(100)
>>> np.random.shuffle(b)
>>> np.corrcoef(a,b)
array([[ 1. , -0.07155116],
[-0.07155116, 1. ]])
>>> mannwhitneyu(a, b)
(5000.0, 0.49951259627554112) # result for almost not correlated
>>> mannwhitneyu(a, a)
(5000.0, 0.49951259627554112) # result for perfectly correlated
Because a and b have the same distributions we fail to reject the null hypothesis that the distributions are identical.
And since in features selection you are trying find features that mostly explain Y, Mann-Whitney U does not help you with that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With