Python sklearn RandomForestClassifier non-reproducible results

Question

I've been using sklearn's random forest, and I've tried to compare several models. Then I noticed that random-forest is giving different results even with the same seed. I tried it both ways: random.seed(1234) as well as use random forest built-in random_state = 1234 In both cases, I get non-repeatable results. What have I missed...?

# 1
random.seed(1234)
RandomForestClassifier(max_depth=5, max_features=5, criterion='gini', min_samples_leaf = 10)
# or 2
RandomForestClassifier(max_depth=5, max_features=5, criterion='gini', min_samples_leaf = 10, random_state=1234)

Any ideas? Thanks!!

EDIT: Adding a more complete version of my code

clf = RandomForestClassifier(max_depth=60, max_features=60, \
                        criterion='entropy', \
                        min_samples_leaf = 3, random_state=seed)
# As describe, I tried random_state in several ways, still diff results
clf = clf.fit(X_train, y_train)

predicted = clf.predict(X_test)
predicted_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(np.array(y_test), predicted_prob)
auc = metrics.auc(fpr,tpr)
print (auc)

EDIT: It's been quite a while, but I think using RandomState might solve the problem. I didn't test it yet myself, but if you're reading it, it's worth a shot. Also, it is generally preferable to use RandomState instead of random.seed().

seralouk · Accepted Answer

First make sure that you have the latest versions of the needed modules(e.g. scipy, numpy etc). When you type random.seed(1234), you use the numpy generator.

When you use random_state parameter inside the RandomForestClassifier, there are several options: int, RandomState instance or None.

From the docs here :

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.

A way to use the same generator in both cases is the following. I use the same (numpy) generator in both cases and I get reproducible results (same results in both cases).

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *

X, y = make_classification(n_samples=1000, n_features=4,
                       n_informative=2, n_redundant=0,
                       random_state=0, shuffle=False)

random.seed(1234)
clf = RandomForestClassifier(max_depth=2)
clf.fit(X, y)

clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
clf2.fit(X, y)

Check if the results are the same:

all(clf.predict(X) == clf2.predict(X))
#True

Check after running the same code for 5 times:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *

for i in range(5):

    X, y = make_classification(n_samples=1000, n_features=4,
                       n_informative=2, n_redundant=0,
                       random_state=0, shuffle=False)

    random.seed(1234)
    clf = RandomForestClassifier(max_depth=2)
    clf.fit(X, y)

    clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
    clf2.fit(X, y)

    print(all(clf.predict(X) == clf2.predict(X)))

Results:

True
True
True
True
True

Python sklearn RandomForestClassifier non-reproducible results

Tags:

python

random

machine-learning

random-forest

reproducible-research

Ruslan

1 Answers

Check after running the same code for 5 times:

seralouk

Recent Activity

Donate For Us

Python sklearn RandomForestClassifier non-reproducible results

Tags:

python

random

machine-learning

random-forest

reproducible-research

Ruslan

1 Answers

Check after running the same code for 5 times:

seralouk

Related questions

Recent Activity

Donate For Us