Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Pytorch, how can i shuffle a DataLoader?

I have a dataset with 10000 samples, where the classes are present in an ordered manner. First I loaded the data into an ImageFolder, then into a DataLoader, and I want to split this dataset into a train-val-test set. I know the DataLoader class has a shuffle parameter, but thats not good for me, because it only shuffles the data when enumeration happens on it. I know about the RandomSampler function, but with it, i can only take n amount of data randomly from the dataset, and i have no control of what is being taken out, so one sample might be present in the train,test and val set at the same time.

Is there a way to shuffle the data in a DataLoader? The only thing i need is the shuffle, after that i can subset the data.

like image 609
NeverSayEver Avatar asked Dec 02 '25 22:12

NeverSayEver


1 Answers

The Subset dataset class takes indices (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset). You can probably exploit that to get this functionality as below. Essentially, you can get away by shuffling the indices and then picking the subset of the dataset.

# suppose dataset is the variable pointing to whole datasets
N = len(dataset)

# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too

# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]

train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)
like image 96
Umang Gupta Avatar answered Dec 05 '25 18:12

Umang Gupta