How can i reduce memory usage of Scikit-Learn Vectorizers?

Question

TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.

So we go for HashingVectorizer but still need to know how to distribute the hashing vectorizer.Fit and partial fit does nothing so how to work with Huge Corpus?

ogrisel · Accepted Answer

I would strongly recommend you to use the HashingVectorizer when fitting models on large dataset.

The HashingVectorizer is data independent, only the parameters from vectorizer.get_params() are important. Hence (un)pickling `HashingVectorizer instance should be very fast.

The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.

Gireesh Ramji · Answer

One way to overcome the inability of HashingVectorizer to account for IDF is to index your data into elasticsearch or lucene and retrieve termvectors from there using which you can calculate Tf-IDF.

How can i reduce memory usage of Scikit-Learn Vectorizers?

Tags:

python

machine-learning

numpy

scipy

scikit-learn

Phyo Arkar Lwin

2 Answers

ogrisel

Gireesh Ramji

Recent Activity

Donate For Us

How can i reduce memory usage of Scikit-Learn Vectorizers?

Tags:

python

machine-learning

numpy

scipy

scikit-learn

Phyo Arkar Lwin

2 Answers

ogrisel

Gireesh Ramji

Related questions

Recent Activity

Donate For Us