Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit train_test_split by an index

I have a pandas dataframe indexed by date. Let's assume it from Jan-1 to Jan-30. I want to split this dataset into X_train, X_test, y_train, y_test but I don't want to mix the dates so I want the train and test samples to be divided by a certain date (or index). I'm trying

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

But when I check the values, I see the dates are mixed. I want to split my data as:

Jan-1 to Jan-24 to train and Jan-25 to Jan-30 to test (as test_size is 0.2, that makes 24 to train and 6 to test)

How can I do this?

like image 390
iso_9001_ Avatar asked Sep 06 '25 12:09

iso_9001_


1 Answers

you should use

X_train, X_test, y_train, y_test = train_test_split(X,Y, shuffle=False, test_size=0.2, stratify=None)

don't use random_state=None it will take numpy.random

in here its mentioned that use shuffle=False along with stratify=None

like image 70
Nihal Avatar answered Sep 09 '25 01:09

Nihal