Finding the most similar documents (nearest neighbours) from a set of documents

Question

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.

I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.

I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).

Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).

Any advice for approaches or what to read up on would be greatly appreciated.

AN6U5 · Accepted Answer

So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.

The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.

Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.

Finding the most similar documents (nearest neighbours) from a set of documents

Tags:

python

nltk

scikit-learn

fohx

1 Answers

AN6U5

Recent Activity

Donate For Us

Finding the most similar documents (nearest neighbours) from a set of documents

Tags:

python

nltk

scikit-learn

fohx

1 Answers

AN6U5

Related questions

Recent Activity

Donate For Us