I have beening using shuffle option for pytorch dataloader for many times. But I was wondering when this shuffle happens and whether it is performed dynamically during iteration. Take the following code as an example:
namesDataset = NamesDataset()
namesTrainLoader = DataLoader(namesDataset, batch_size=16, shuffle=True)
for batch_data in namesTrainLoader:
    print(batch_data)
When we define "namesTrainLoader", does that mean the shuffling is finished and the following iteration will be based on a fixed order of data? Will there be any randomness in the for loop after namesTrainLoader was defined?
I was trying to replace half of "batch_data" with some special value:
for batch_data in namesTrainLoader:
    batch_data[:8] = special_val
    pre = model(batch_data)
Let us say there will be infinite number of epoches, will "model" eventually see all the data in "namesTrainLoader"? Or half of the data of "namesTrainLoader" is actually lost to "model"?
Dataloader shuffles at every epoch.
A PyTorch dataloader will take your raw dataset and automatically slice it up into mini-batches. In addition, if your dataset has a lot of sequential labels that are the same, you can opt to use the shuffle option to have them automatically shuffled as you feed them into the training loop.
pytorch(buffer_size = 2048) . First, the dataloader randomly selects chunks of from the applicable tensors until the shuffle buffer is full. Next, the indices that are available in shuffle buffer are randomly sampled to construct the batches that are returned by the dataloader.
Yes, shuffling the validation/test data will not have any impact on the accuracy, loss etc. Shuffling is done during the training to make sure we aren't exposing our model to the same cycle (order) of data in every epoch.
The shuffling happens when the iterator is created. In the case of the for loop, that happens just before the for loop starts.
You can create the iterator manually with:
# Iterator gets created, the data has been shuffled at this point.
data_iterator = iter(namesTrainLoader)
By default the data loader uses torch.utils.data.RandomSampler if you set shuffle=True (without providing your own sampler). Its implementation is very straight forward and you can see where the data is shuffled when the iterator is created by looking at the RandomSampler.__iter__ method:
def __iter__(self):
    n = len(self.data_source)
    if self.replacement:
        return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
    return iter(torch.randperm(n).tolist())
The return statement is the important part, where the shuffling takes place. It simply creates a random permutation of the indices.
That means you will see your entire dataset every time you fully consume the iterator, just in a different order every time. Therefore there is no data lost (not including cases with drop_last=True) and your model will see all data at every epoch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With