Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python sklearn.mixture.GMM not robust to scale?

I'm using sklearn.mixture.GMM in Python, and the results seem to depend on data scaling. In the following code example, I change the overall scaling but I do NOT change the relative scaling of the dimensions. Yet under the three different scaling settings I get completely different results:

from sklearn.mixture import GMM
from numpy import array, shape
from numpy.random import randn
from random import choice

# centroids will be normally-distributed around zero:
truelumps = randn(20, 5) * 10

# data randomly sampled from the centroids:
data = array([choice(truelumps) + randn(5) for _ in xrange(1000)])

for scaler in [0.01, 1, 100]:
    scdata = data * scaler
    thegmm = GMM(n_components=10)
    thegmm.fit(scdata, n_iter=1000)
    ll = thegmm.score(scdata)
    print sum(ll)

Here's the output I get:

GMM(cvtype='diag', n_components=10)
7094.87886779
GMM(cvtype='diag', n_components=10)
-14681.566456
GMM(cvtype='diag', n_components=10)
-37576.4496656

In principle, I don't think the overall data scaling should matter, and the total log-likelihoods should come out similar each time. But maybe there's an implementation issue I'm overlooking?

like image 762
Dan Stowell Avatar asked Dec 04 '25 11:12

Dan Stowell


1 Answers

I've had an answer via the scikit-learn mailing list: in my code example, the log-likelihood should indeed vary with scale (because we're evaluating point likelihoods, not integrals), by a factor relating to log(scale). So I think my code example in fact shows GMM giving correct results.

like image 89
Dan Stowell Avatar answered Dec 06 '25 03:12

Dan Stowell



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!