Problem: OutOfMemory error is showing on applying the PCA on 8 million features.
Here is my code snipet:-
from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=10000)
pca_tfidf_sklearn = sklearn_pca.fit(traindata_tfidf.toarray())
I want to apply the PCA / dimension reduction techniques on text extracted features (using tf-idf). Currently I am having around 8 million such feature and I want to reduce those features and to classify the documents I am using the MultiNomialNB.
And I am stucked due to the OutOfMemory error.
I have had a similar problem. Using a Restricted Boltzmann Machine (RBM) instead of PCA fixed the problem. Mathematically, this is because PCA only looks at the EigenValues and EigenVectors of your feature matrix whereas RBM works as a neural network to consider all multiplicative possibilities of the features in your data. Therefore, RBM has a much greater set to consider when deciding which features are more important. It then reduces the quantity of features to a much smaller size with more important features than PCA can. However, be sure to Feature Scale and Normalize the data before applying an RBM to the data.
I suppose, traindata_tfidf
is actually in a sparse form. Try using one of scipy sparse formats instead of an array. Also take a look at SparsePCA methods, and if it doesn't help, use MiniBatchSparsePCA.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With