I have been looking for a maximum entropy classification implementation which can deal with an output size of 500 classes and 1000 features. My training data has around 30,000,000 lines. I have tried using MegaM, the 64-bit R maxent package, the maxent tool from the University of Edinburgh but as expected, none of them can handle the size of data. However, the size of the data set doesn't seem too out of the world for nlp tasks of this nature. Are there any techniques that I should be employing? Or any suggestion for a toolkit which I may use? I am trying to run this on a 64-bit Windows machine with 8GB of RAM,using Cygwin where required.
Vowpal Wabbit is currently regarded as the fastest large-scale learner. LibLinear is an alternative, but I'm not sure if it can handle matrices of 3e10 elements.
Note that the term "MaxEnt" is used almost exclusively by NLP people; machine learning folks call it logistic regression or logit, so if you search for that you might find many more tools than when you search for MaxEnt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With