Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalize vectors in gensim model

I have a pre-trained word embedding with vectors of different norms, and I want to normalize all vectors in the model. I am doing it with a for loop that iterates each word and normalizes its vector, but the model us huge and takes too much time. Does gensim include any way to do this faster? I cannot find it.

Thanks!!

like image 391
Rodrigo Serna Pérez Avatar asked Sep 05 '25 20:09

Rodrigo Serna Pérez


1 Answers

Gensim instances of KeyedVectors (the common interface of sets of word-vectors) contain a method init_sims(), which internally calculates unit-length normalized vectors using a native vector operation for speed.

When certain operations that are usually conducted on unit-normalized vectors are attempted for the 1st time, this init_sims() will be automatically called, and the model will cache the normalized vectors in a model property (vectors_norm) – roughly doubling the RAM consumption.

Once it's been called, you can access normed vectors using the .word_vec() method:

normed_wv = kv_model.word_vec(word, use_norm=True)

If you're sure you won't need the raw, un-normed vectors, you can also call init_sim() yourself with its optional replace parameter. Then, the normed vectors will clobber the raw vectors in-place – saving the extra RAM. For example:

kv_model.init_sims(replace=True)

Note that while things like finding the nearest-neighbors of a word, as in the common most_similar() operation, traditionally use unit-normalized vectors, there are sometimes downstream applications where the raw vectors are useful. (Also, in a full Word2Vec model, if you're going to do additional incremental training, that should happen on raw vectors, not normalized vectors.)

like image 103
gojomo Avatar answered Sep 08 '25 10:09

gojomo