Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

word co-occurrence matrix from gensim

Tags:

python

nlp

gensim

When building a python gensim word2vec model, is there a way to see a doc-to-word matrix?

With input of sentences = [['first', 'sentence'], ['second', 'sentence']] I'd see something like*:

      first  second  sentence
doc0    1       0        1
doc1    0       1        1

*I've illustrated 'human readable', but I'm looking for a scipy (or other) matrix, indexed to model.wv.index2word.

And, can that be transformed into a word-to-word matrix (to see co-occurences)? Something like:

          first  second  sentence
first       1       0        1
second      0       1        1  
sentence    1       1        2   

I've already implemented something like word-word co-occurrence matrix using CountVectorizer. It works well. However, I'm already using gensim in my pipeline and speed/code simplicity matter for my use-case.

like image 638
DavidR Avatar asked Oct 28 '25 09:10

DavidR


1 Answers

Given a corpus that is a list of lists of words, what you want to do is create a Gensim Dictionary, change your corpus to bag-of-words and then create your matrix :

from gensim.matutils import corpus2csc
from gensim.corpora import Dictionary

# somehow create your corpus

dct = Dictionary(corpus)
bow_corpus = [dct.doc2bow(line) for line in corpus]
term_doc_mat = corpus2csc(bow_corpus)

Your term_doc_mat is a Numpy compressed sparse matrix. If you want a term-term matrix, you can always multiply it by its transpose, i.e. :

import numpy as np
term_term_mat = np.dot(term_doc_mat, term_doc_mat.T)
like image 142
Syncrossus Avatar answered Oct 30 '25 00:10

Syncrossus