Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find outliers in document classification with million documents?

I have million documents which belongs to different classes (100 classes). I want to find outlier documents in each class (which doesn't belong to that class but wrongly classified) and filter them. I can do document similarity using cosine similarity by comparing the tokens of each document. I am not able to apply this to filter the wrongly classified documents for a given class. Example: Consider the 3 classes for simplicity with the documents under them.

ClassA  ClassB  ClassC ... 
doc1    doc2    doc3 
doc4    doc5    doc6 
doc7    doc8    doc9 

How can I figure out effectively and efficiently that doc4(and other similar docs) is wrongly classified in ClassA, so that my training data does not contain outliers?

like image 928
Gaurav Chawla Avatar asked Oct 20 '25 04:10

Gaurav Chawla


2 Answers

It is a hard problem in unsupervised learning. It is usually called topic modelling. You can start by running LDA (Latent Dirichlet Allocation) algorithm. I suggest using gensim package for that. Don't run it on all your data, take 20-50 thousands documents at the beginning. After you have initial classifier, out of millions of documents you have select only those that were classified as belonging to some class with the probability above certain threshold. Train LDA again on those. This should give you classes that are better separated. Reclassify your data.

The LDA algorithm classifies document in a "soft" way, so each document has a certain probability to belong to each of your 100 classes. But usually, those that have high probability of belonging to many classes at the same time are badly classified.

You can do all that without involving human labelers.

like image 167
igrinis Avatar answered Oct 21 '25 17:10

igrinis


Since you have labels for the 100 classes, this is in principle a fairly standard outlier detection problem, and you need to find documents that do not resemble most of the documents that carry the same label.

As you suggest, you can use cosine similarity (on word counts, I assume) to score the similarity of pairs of documents. There are many practical issues involved with cosine similarity such as selection of important words, stemming, stop words and so on, and you may also wish to consider word similarity, via soft cosine similarity.

It would be impractical to calculate all cosine similarities for such a large corpus so you will need to summarise each class somehow. A simple method would be to average the word word counts for each document type and measure the similarity between this model document and each of the members in the class, so to score each document you only need to calculate a single cosine similarity. You should reject some chosen percentile of documents as potentially misclassified, with a threshold comparable to the percentage of misclassified documents you expect. Obviously a higher threshold will eliminate more errors, but also more correctly classified documents.

A better implementation might be to apply a fast clustering algorithm separately to each of the 100 kinds of documents. The average word count within each cluster would give you a handful of model documents for each label, and you should use the highest similarity as the score for each document.

like image 30
rmac Avatar answered Oct 21 '25 18:10

rmac



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!