I'm using scikit-learn's train_test_split functionality and am getting different results when running the same code repeatedly:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)
When I log the number of unique elements in y_train:
logger.info(len(set(y_train)))
I get different values on repeated runs (with no code changes). I would have thought the random_state would ensure a deterministic split.
How can I ensure the same split each time?
The randomness is not caused by train_test_split as you can see if you run this minimal code multiple times:
from sklearn.model_selection import train_test_split
x = [k for k in range(0, 50)]
y = [k for k in range(0, 50)]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=44)
print (x_train)
You probably have another source of randomness in your code. So maybe numpy/pandas is causing the problem.
The value you set the random_state (42 used in many scikit-learn examples) does not really matter, what is most important is that the value is the same always so you can validate your code multiple times.
There might be some other randomness present in your code that produces different result could you post your complete code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With