I built an scikit-learn model and I want to reuse in a daily python cron job (NB: no other platforms are involved - no R, no Java &c).
I pickled it (actually, I pickled my own object whose one field is a GradientBoostingClassifier), and I un-pickle it in the cron job. So far so good (and has been discussed in Save classifier to disk in scikit-learn and Model persistence in Scikit-Learn?).
However, I upgraded sklearn and now I get these warnings:
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator PriorProbabilityEstimator from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
What do I do now?
I can downgrage to 0.18.1 and stick with it until I am ready to rebuild the model. For various reasons I find this unacceptable.
I can un-pickle the file and re-pickle it again. This worked with 0.18.2, but breaks with 0.19. NFG. joblib looks no better.
I wish I could save the data in a version-independent ASCII format (e.g., JSON or XML). This is, obviously, the optimal solution, but there seems to be NO way to do that (see also Sklearn - model persistence without pkl file).
I could save the model to PMML, but its support is lukewarm at best:
I can use sklearn2pmml to save the model (although not easily), and augustus/lightpmmlpredictor to apply (although not load) the model. However, none of those is available to pip directly, which makes deployment a nightmare. Also, the augustus & lightpmmlpredictor projects seem to be dead. Importing PMML models into Python (Scikit-learn) - nope.
A variant of the above: save PMML using sklearn2pmml, and use openscoring for scoring. Requires interfacing with an external process. Yuk.
Suggestions?
Model persistence across different versions of scikit-learn is generally impossible. The reason is obvious: you pickle Class1 with one definition, and want to unpickle it into Class2 with another definition. 
You can:
Class1 will work also for Class2.  GradientBoostingClassifier and restore it from this serialized form, and hope that it would work better than pickle.I made an example of how you can convert a single DecisionTreeRegressor into a pure list-and-dict format, fully JSON-compatible, and restore it back. 
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_classification
### Code to serialize and deserialize trees
LEAF_ATTRIBUTES = ['children_left', 'children_right', 'threshold', 'value', 'feature', 'impurity', 'weighted_n_node_samples']
TREE_ATTRIBUTES = ['n_classes_', 'n_features_', 'n_outputs_']
def serialize_tree(tree):
    """ Convert a sklearn.tree.DecisionTreeRegressor into a json-compatible format """
    encoded = {
        'nodes': {},
        'tree': {},
        'n_leaves': len(tree.tree_.threshold),
        'params': tree.get_params()
    }
    for attr in LEAF_ATTRIBUTES:
        encoded['nodes'][attr] = getattr(tree.tree_, attr).tolist()
    for attr in TREE_ATTRIBUTES:
        encoded['tree'][attr] = getattr(tree, attr)
    return encoded
def deserialize_tree(encoded):
    """ Restore a sklearn.tree.DecisionTreeRegressor from a json-compatible format """
    x = np.arange(encoded['n_leaves'])
    tree = DecisionTreeRegressor().fit(x.reshape((-1,1)), x)
    tree.set_params(**encoded['params'])
    for attr in LEAF_ATTRIBUTES:
        for i in range(encoded['n_leaves']):
            getattr(tree.tree_, attr)[i] = encoded['nodes'][attr][i]
    for attr in TREE_ATTRIBUTES:
        setattr(tree, attr, encoded['tree'][attr])
    return tree
## test the code
X, y = make_classification(n_classes=3, n_informative=10)
tree = DecisionTreeRegressor().fit(X, y)
encoded = serialize_tree(tree)
decoded = deserialize_tree(encoded)
assert (decoded.predict(X)==tree.predict(X)).all()
Having this, you can go on to serialize and deserialize the whole GradientBoostingClassifier:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble.gradient_boosting import PriorProbabilityEstimator
def serialize_gbc(clf):
    encoded = {
        'classes_': clf.classes_.tolist(),
        'max_features_': clf.max_features_, 
        'n_classes_': clf.n_classes_,
        'n_features_': clf.n_features_,
        'train_score_': clf.train_score_.tolist(),
        'params': clf.get_params(),
        'estimators_shape': list(clf.estimators_.shape),
        'estimators': [],
        'priors':clf.init_.priors.tolist()
    }
    for tree in clf.estimators_.reshape((-1,)):
        encoded['estimators'].append(serialize_tree(tree))
    return encoded
def deserialize_gbc(encoded):
    x = np.array(encoded['classes_'])
    clf = GradientBoostingClassifier(**encoded['params']).fit(x.reshape(-1, 1), x)
    trees = [deserialize_tree(tree) for tree in encoded['estimators']]
    clf.estimators_ = np.array(trees).reshape(encoded['estimators_shape'])
    clf.init_ = PriorProbabilityEstimator()
    clf.init_.priors = np.array(encoded['priors'])
    clf.classes_ = np.array(encoded['classes_'])
    clf.train_score_ = np.array(encoded['train_score_'])
    clf.max_features_ = encoded['max_features_']
    clf.n_classes_ = encoded['n_classes_']
    clf.n_features_ = encoded['n_features_']
    return clf
# test on the same problem
clf = GradientBoostingClassifier()
clf.fit(X, y);
encoded = serialize_gbc(clf)
decoded = deserialize_gbc(encoded)
assert (decoded.predict(X) == clf.predict(X)).all()
This works for scikit-learn v0.19, but don't ask me what will come in the next versions to break this code. I'm neither a prophet nor a developer of sklearn.
If you want to be fully independent of new versions of sklearn, the safest thing is to write a function that traverses a serialized tree and makes the prediction, instead of re-creating an sklearn tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With