Scikit-learn: train/test split not reproducible

Question

I'm using scikit-learn's train_test_split functionality and am getting different results when running the same code repeatedly:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)

When I log the number of unique elements in y_train:

logger.info(len(set(y_train)))

I get different values on repeated runs (with no code changes). I would have thought the random_state would ensure a deterministic split.

How can I ensure the same split each time?

Tim · Accepted Answer

The randomness is not caused by train_test_split as you can see if you run this minimal code multiple times:

from sklearn.model_selection import train_test_split

x = [k for k in range(0, 50)]
y = [k for k in range(0, 50)]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=44)

print (x_train)

You probably have another source of randomness in your code. So maybe numpy/pandas is causing the problem.

Suleiman · Answer

The value you set the random_state (42 used in many scikit-learn examples) does not really matter, what is most important is that the value is the same always so you can validate your code multiple times.

There might be some other randomness present in your code that produces different result could you post your complete code.

Scikit-learn: train/test split not reproducible

Tags:

python

scikit-learn

anon_swe

2 Answers

Tim

Suleiman

Recent Activity

Donate For Us

Scikit-learn: train/test split not reproducible

Tags:

python

scikit-learn

anon_swe

2 Answers

Tim

Suleiman

Related questions

Recent Activity

Donate For Us