Can we build word2vec model in a distributed way?

Question

Currently I have 1.2tb text data to build gensim's word2vec model. It is almost taking 15 to 20 days to complete.

I want to build model for 5tb of text data, then it might take few months to create model. I need to minimise this execution time. Is there any way we can use multiple big systems to create model?

Please suggest any way which can help me in reducing the execution time.

FYI, I have all my data in S3 and I use smart_open module to stream the data.

Guillaume Massé · Accepted Answer

You can use Apache Spark. https://javadoc.io/doc/org.apache.spark/spark-mllib_2.12/latest/org/apache/spark/mllib/feature/Word2Vec.html

Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

Apache/Spark at e053c55

Can we build word2vec model in a distributed way?

Tags:

deep-learning

nlp

distributed-computing

gensim

word2vec

Uma Maheswara Rao Pinninti

1 Answers

Guillaume Massé

Recent Activity

Donate For Us

Can we build word2vec model in a distributed way?

Tags:

deep-learning

nlp

distributed-computing

gensim

word2vec

Uma Maheswara Rao Pinninti

1 Answers

Guillaume Massé

Related questions

Recent Activity

Donate For Us