Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn: How to apply dimensionality reduction on huge data set?

Problem: OutOfMemory error is showing on applying the PCA on 8 million features.

Here is my code snipet:-

from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=10000)
pca_tfidf_sklearn = sklearn_pca.fit(traindata_tfidf.toarray())

I want to apply the PCA / dimension reduction techniques on text extracted features (using tf-idf). Currently I am having around 8 million such feature and I want to reduce those features and to classify the documents I am using the MultiNomialNB.

And I am stucked due to the OutOfMemory error.

like image 836
Aman Tandon Avatar asked Sep 05 '25 03:09

Aman Tandon


2 Answers

I have had a similar problem. Using a Restricted Boltzmann Machine (RBM) instead of PCA fixed the problem. Mathematically, this is because PCA only looks at the EigenValues and EigenVectors of your feature matrix whereas RBM works as a neural network to consider all multiplicative possibilities of the features in your data. Therefore, RBM has a much greater set to consider when deciding which features are more important. It then reduces the quantity of features to a much smaller size with more important features than PCA can. However, be sure to Feature Scale and Normalize the data before applying an RBM to the data.

like image 184
London Holmes Avatar answered Sep 07 '25 19:09

London Holmes


I suppose, traindata_tfidf is actually in a sparse form. Try using one of scipy sparse formats instead of an array. Also take a look at SparsePCA methods, and if it doesn't help, use MiniBatchSparsePCA.

like image 24
cyberj0g Avatar answered Sep 07 '25 19:09

cyberj0g