I am working with multiple files, and multiple training samples in each file. I will use ConcatDataset as described here:
https://discuss.pytorch.org/t/dataloaders-multiple-files-and-multiple-rows-per-column-with-lazy-evaluation/11769/7
I need to have negative samples in addition to my true samples, and I need my negative samples to be randomly selected from all the training data files. So, I am wondering, would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?
If there are more details needed about what I am trying to do exactly, it's because I am trying to train over a TPU with Pytorch XLA.
Normally for negative samples, I would just use a 2nd DataSet and DataLoader, however, I am trying to train over TPUs with Pytorch XLA (alpha was just released a few days ago https://github.com/pytorch/xla ), and to do that I need to send my DataLoader to a torch_xla.distributed.data_parallel.DataParallel object, like model_parallel(train_loop_fn, train_loader) which can be seen in these example notebooks
https://github.com/pytorch/xla/blob/master/contrib/colab/resnet18-training-xrt-1-15.ipynb
https://github.com/pytorch/xla/blob/master/contrib/colab/mnist-training-xrt-1-15.ipynb
So, I am now limited to a single DataLoader, which will need to handle both the true samples, and negative samples that need to be randomly selected from all my files.
ConcatDataset is a custom class that is subclassed from torch.utils.data.Dataset. Let's take a look at one example.
class ConcatDataset(torch.utils.data.Dataset):
def __init__(self, *datasets):
self.datasets = datasets
def __getitem__(self, i):
return tuple(d[i] for d in self.datasets)
def __len__(self):
return min(len(d) for d in self.datasets)
train_loader = torch.utils.data.DataLoader(
ConcatDataset(
dataset1,
dataset2
),
batch_size=args.batch_size,
shuffle=True,
num_workers=args.workers,
pin_memory=True)
for i, (input, target) in enumerate(train_loader):
...
Here, two datasets namely dataset1 (a list of examples) and dataset2 are combined to form a single training dataset. The __getitem__ function returns one example from the dataset and will be used by the BatchSampler to form the training mini-batches.
Would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?
Since you have combined all your data files to form one dataset, now it depends on what BatchSampler do you use to sample mini-batches. There are several samplers implemented in PyTorch, for example, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler. See their usage in the documentation.
You can have your custom BatchSampler too as follows.
class MyBatchSampler(Sampler):
def __init__(self, *params):
# write your code here
def __iter__(self):
# write your code here
# return an iterable
def __len__(self):
# return the size of the dataset
The __iter__ function should return an iterable of mini-batches. You can implement your logic of forming mini-batches in this function.
To randomly sample negative examples for training, one alternative could be to pick negative examples for each positive example in the __init__ function of the ConcatDataset class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With