I'm trying to use low-rank-approximation for latent semantic indexing. I thought that doing low rank approximation reduces matrix dimensions but it contradicts the results I get.
Assume I have my dictionary with 40 000 words and 2000 documents. Then my term-by-document matrix is 40 000 x 2000. According to wikipedia, I have to do SVD of a matrix and then apply
This is the code I use for SVD and low rank approximation (the matrix is sparse):
import scipy
import numpy as np
u, s, vt = scipy.sparse.linalg.svds(search_matrix, k=20)
search_matrix = u @ np.diag(s) @ vt
print('u: ', u.shape) # (40000, 20)
print('s: ', s.shape) # (20, )
print('vt: ', vt.shape) # (20, 2000)
The result matrix is: (40 000 x 20) * (20 x 20) * (20, 2000) = 40 000 x 2000, which is exactly what I started with.
So... how does the low-rank-approximation reduce the dimensions of the matrix exactly?
Also, I will be doing queries on this approximated matrix to find correlation between user vector and each document (naive search engine). The user vector has dimensions 40 000 x 1 to start with (bag of words). According to the same wikipedia page, this is what I should do:
The code:
user_vec = np.diag((1 / s)) @ u.T @ user_vec
And it produces a matrix 20 x 1 which is what I expected! ((20 x 20) * (20 x 40 000) * (40 000 x 1) = (20 x 1)). But now, it has dimensions that do not match the search_matrix I want to multiply it with.
So... What am I doing wrong and why?
Sources:
About low rank approximation :
The goal is to have a matrix that you can store with less memory and with which you can compute faster.
But you want it to have the same behavior as the original matrix (in particular the same dimensions).
That's why you use a product of matrices. They give you a small rank, but without changing the dimensions of the matrix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With