I am working on 3D point identification using the RandomForest method from scikit. One of the issues I keep running into is that certain classes are present more often then other classes.
This means that in the process of generating predictions from the trained classifier, if the classifier is uncertain of a point class it will more likely assume it belongs to one of the common classes rather than the less common class.
I see that in the scikit documentation for random forests there is a sample_weight parameter in the fit method. From what I can tell that just weights certain overall samples(say I have 50 files I am training from, it will weight the first sample twice as heavily as everything else) rather than classes.
This doesn't fix the issue because the least common classes are about as rare in all the samples I have. It's just the nature of that particular class.
I've found some papers on balanced random forests and weighted random forests. But I haven't seen anything about how to use this in scikit. I'm hoping I'm wrong - is there a way to weight classes built in? Should I write something separate that artificially evens up the weight of different classes in my samples?
Sample_weight, according to the documentation, seems to be referring to samples and not class weight. So if I have files A, B and C and classes 1, 2 and 3, let's say: 
A = [1 1 1 2]
B = [2 2 1 1]
C = [3 1 1 1]
Looking above we have a situation, very simplified, in which we have very few of class 3 compared to the other classes. My situation has 8 classes and is training on millions of points but the ratio is still incredibly skewed against two particular classes.
Using the sample_weight, which takes in an array of size m(m being the number of samples), I would be able to weight how heavily any of those three files work. So my understanding is that I can do a sample_weight = [1 1 2] which would make the sample C be twice as strong as the other two samples.
However, this doesn't really help because my issue is that the class 3 is super rare(in the actual data it's 1k points out of millions rather than 1 out of 12).
Increasing the weight of any given sample won't increase the weight of particular classes unless I fake some data in which the sample is composed of almost nothing but that particular class.
I found sklearn.preprocessing.balance_weights(y) in the documentation but I can't find anyone using it. In theory it does what I need it to do but I don't see how I can fit the weights array back into my Random Forest.
I'm guessing this only applies for the newer version of scikit-learn, but you can now use this.
rf = RandomForestClassifier(class_weight="balanced")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With