I am trying to create an ensemble of three classifiers (Random Forest, Support Vector Machine and XGBoost) using the VotingClassifier() in scikit-learn. However, I find that the accuracy of the ensemble actually decreases instead of increasing. I can't figure out why.
Here is the code:
from sklearn.ensemble import VotingClassifier
eclf = VotingClassifier(estimators=[('rf', rf_optimized), ('svc', svc_optimized), ('xgb', xgb_optimized)], 
                        voting='soft', weights=[1,1,2])
for clf, label in zip([rf, svc_optimized, xgb_optimized, eclf], ['Random Forest', 'Support Vector Machine', 'XGBoost', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), scores.std(), label))
The XGBoost has the highest accuracy so I even tried giving it more weightage to no avail.
What could I be doing wrong?
VotingClassifiers are not always guaranteed to have better performance, especially when using soft voting if you have poorly calibrated base models.
For a contrived example, say all of the models are really wrong when they are wrong (say give a probability of .99 for the incorrect class) but are only slightly right when they are right (say give a probability of .51 for the correct class). Furthermore, say 'rf' and 'svc' are always right when 'xgb' is wrong and vice versa and each classifier has an accuracy of 50% on its own.
The voting classifier that you implement would have an accuracy of 0% since you are using soft voting. Here is why:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With