I have two lists, A is a list of words, for example ["hello","world",......], Len(A) is 10000. List B contains the all pre-trained vectors corresponding to A, which is a [10000,512], 512 is the vector dimension. I want to convert two lists into gensim word2vec model format in order to load the model in later, such as model = Word2Vec.load("word2vec.model") how should I do this?
As you only have the words and their vectors, you don't quite have enough info for a full Word2Vec model (which includes other things like the internal neural network's hidden weights, and word frequencies).
But you can create a gensim KeyedVectors object, of the general kind that's in a gensim Word2Vec model .wv property. It has many of the helper methods (like most_similar()) you may be interested in using.
Let's assume your A list-of-words is in a more-helpfully named Python list called words_list, and your B list-of-vectors is in a more-helpfully named Python list called 'vectors_list`.
Try:
from gensim.models import KeyedVectors
kv = new KeyedVectors(512)
kv.add(words_list, vectors_list)
kv.save(`mywordvecs.kvmodel`)
You could then later re-load these via:
kv2 = KeyedVectors.load(`mywordvecs.kvmodel`)
(You could also use save_word2vec_format() and load_word2vec_format() instead of gensim's native save()/load(), if you wanted simpler plain-vectors formats that could also be loaded by other tools that use that format. But if you're staying within gensim, the plain save()/load() are just as good – and would be better if saving a more complex trained Word2Vec model, because they'd retain the extra info those objects contain.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With