Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does a `DataLoader` created from `ConcatDataset` create a batch from a different files, or a single file?

Tags:

pytorch

I am working with multiple files, and multiple training samples in each file. I will use ConcatDataset as described here:

https://discuss.pytorch.org/t/dataloaders-multiple-files-and-multiple-rows-per-column-with-lazy-evaluation/11769/7

I need to have negative samples in addition to my true samples, and I need my negative samples to be randomly selected from all the training data files. So, I am wondering, would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?

If there are more details needed about what I am trying to do exactly, it's because I am trying to train over a TPU with Pytorch XLA.

Normally for negative samples, I would just use a 2nd DataSet and DataLoader, however, I am trying to train over TPUs with Pytorch XLA (alpha was just released a few days ago https://github.com/pytorch/xla ), and to do that I need to send my DataLoader to a torch_xla.distributed.data_parallel.DataParallel object, like model_parallel(train_loop_fn, train_loader) which can be seen in these example notebooks

https://github.com/pytorch/xla/blob/master/contrib/colab/resnet18-training-xrt-1-15.ipynb

https://github.com/pytorch/xla/blob/master/contrib/colab/mnist-training-xrt-1-15.ipynb

So, I am now limited to a single DataLoader, which will need to handle both the true samples, and negative samples that need to be randomly selected from all my files.

like image 883
SantoshGupta7 Avatar asked Nov 24 '25 13:11

SantoshGupta7


1 Answers

ConcatDataset is a custom class that is subclassed from torch.utils.data.Dataset. Let's take a look at one example.

class ConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        return tuple(d[i] for d in self.datasets)

    def __len__(self):
        return min(len(d) for d in self.datasets)

train_loader = torch.utils.data.DataLoader(
             ConcatDataset(
                 dataset1, 
                 dataset2
             ),
             batch_size=args.batch_size, 
             shuffle=True,
             num_workers=args.workers, 
             pin_memory=True)

for i, (input, target) in enumerate(train_loader):
    ...

Here, two datasets namely dataset1 (a list of examples) and dataset2 are combined to form a single training dataset. The __getitem__ function returns one example from the dataset and will be used by the BatchSampler to form the training mini-batches.

Would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?

Since you have combined all your data files to form one dataset, now it depends on what BatchSampler do you use to sample mini-batches. There are several samplers implemented in PyTorch, for example, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler. See their usage in the documentation.

You can have your custom BatchSampler too as follows.

class MyBatchSampler(Sampler):
    def __init__(self, *params):
        # write your code here

    def __iter__(self):
        # write your code here
        # return an iterable

    def __len__(self):
        # return the size of the dataset

The __iter__ function should return an iterable of mini-batches. You can implement your logic of forming mini-batches in this function.

To randomly sample negative examples for training, one alternative could be to pick negative examples for each positive example in the __init__ function of the ConcatDataset class.

like image 76
Wasi Ahmad Avatar answered Nov 27 '25 22:11

Wasi Ahmad



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!