I'm creating an image classification model with Inception V3 and have two classes. I've split my dataset and labels into two numpy arrays.The data is split with trainX and testY as the images and trainY and testY as the corresponding labels.
data = np.array(data, dtype="float")/255.0
labels = np.array(labels,dtype ="uint8")
(trainX, testX, trainY, testY) = train_test_split(
                                data,labels, 
                                test_size=0.2, 
                                random_state=42) 
train_datagen = keras.preprocessing.image.ImageDataGenerator(
          zoom_range = 0.1,
          width_shift_range = 0.2, 
          height_shift_range = 0.2,
          horizontal_flip = True,
          fill_mode ='nearest') 
val_datagen = keras.preprocessing.image.ImageDataGenerator()
train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=True)
validation_generator = val_datagen.flow(
                testX,
                testY,
                batch_size=batch_size) 
When I shuffle train_generator with ImageDataGenerator, will the images still match the corresponding labels? Also should the validation dataset be shuffled as well?
Keras ImageDataGenerator is used for getting the input of the original data and further, it makes the transformation of this data on a random basis and gives the output resultant containing only the data that is newly transformed.
The pattern for using the ImageDataGenerator class is used as follows: Construct and configure an instance of the ImageDataGenerator class. Retrieve an iterator by calling the flow_from_directory() function. Use the iterator in the training or evaluation of a model.
You can do this by calling the fit() function on the data generator and passing it to your training dataset. The data generator itself is, in fact, an iterator, returning batches of image samples when requested.
flow_from_directory Method This method will identify classes automatically from the folder name. For this method, arguments to be used are: directory value : The path to parent directory containing sub-directories(class/label) with images. classes value : Name of the class/classes for which images should be loaded.
Yes, the images will still match the corresponding labels so you can safely set shuffle to True. Under the hood it works as follows. Calling .flow() on the ImageDataGenerator will return you a NumpyArrayIterator object, which implements the following logic for shuffling the indices:
def _set_index_array(self):
    self.index_array = np.arange(self.n)
    if self.shuffle: # if shuffle==True, shuffle the indices
        self.index_array = np.random.permutation(self.n) 
self.index_array is then used to yield both the images (x) and the labels (y) (code truncated for readability):
def _get_batches_of_transformed_samples(self, index_array):
    batch_x = np.zeros(tuple([len(index_array)] + list(self.x.shape)[1:]),
                       dtype=self.dtype)
    # use index_array to get the x's
    for i, j in enumerate(index_array):
        x = self.x[j]
        ... # data augmentation is done here
        batch_x[i] = x
     ...
     # use the same index_array to fetch the labels
     output += (self.y[index_array],)
    return output
Check out the source code yourself, it might be easier to understand than you think.
Shuffling the validation data shouldn't matter too much. The main point of shuffling is to introduce some extra stochasticity in the training process.
shuffle is "True" per default, so you must add
 train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With