I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.
I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.
I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.
vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))
SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this
clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)
But this suffers from some serious issues:
There are two main methods for tackling a multi-label classification problem: problem transformation methods and algorithm adaptation methods. Problem transformation methods transform the multi-label problem into a set of binary classification problems, which can then be handled using single-class classifiers.
Definition. Multi-label learning is an extension of the standard supervised learning setting. In contrast to standard supervised learning where one training example is asso- ciated with a single class label, in multi-label learning one training example is associated with multiple class labels simultaneously.
Multi-Label Text Classification means a classification task with more than two classes; each label is mutually exclusive. The classification makes the assumption that each sample is assigned to one and only one label. On the opposite hand, Multi-label classification assigns to every sample a group of target labels.
I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though. Not sure about speeding up hashing vectorizer. You gotta ask @Larsmans and @ogrisel for that ;)
Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.
OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With