How to apply data augmentation in TensorFlow 2.0 after tfds.load()

Tags:

I'm following this guide.

It shows how to download datasets from the new TensorFlow Datasets using tfds.load() method:

import tensorflow_datasets as tfds    
SPLIT_WEIGHTS = (8, 1, 1)
splits = tfds.Split.TRAIN.subsplit(weighted=SPLIT_WEIGHTS)

(raw_train, raw_validation, raw_test), metadata = tfds.load(
    'cats_vs_dogs', split=list(splits),
    with_info=True, as_supervised=True)

The next steps shows how to apply a function to each item in the dataset using map method:

def format_example(image, label):
    image = tf.cast(image, tf.float32)
    image = image / 255.0
    # Resize the image if required
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    return image, label

train = raw_train.map(format_example)
validation = raw_validation.map(format_example)
test = raw_test.map(format_example)

Then to access the elements we can use:

for features in ds_train.take(1):
  image, label = features["image"], features["label"]

for example in tfds.as_numpy(train_ds):
  numpy_images, numpy_labels = example["image"], example["label"]

However, the guide doesn't mention anything about data augmentation. I want to use real time data augmentation similar to that of Keras's ImageDataGenerator Class. I tried using:

if np.random.rand() > 0.5:
    image = tf.image.flip_left_right(image)

and other similar augmentation functions in format_example() but, how can I verify that it's performing real time augmentation and not replacing the original image in the dataset?

I could convert the complete dataset to Numpy array by passing batch_size=-1 to tfds.load() and then use tfds.as_numpy() but, that would load all the images in memory which is not needed. I should be able to use train = train.prefetch(tf.data.experimental.AUTOTUNE) to load just enough data for next training loop.

319

asked Mar 13 '19 11:03

himanshurawlani

1 Answers

You are approaching the problem from a wrong direction.

First, download data using tfds.load, cifar10 for example (for simplicity we will use default TRAIN and TEST splits):

import tensorflow_datasets as tfds

dataloader = tfds.load("cifar10", as_supervised=True)
train, test = dataloader["train"], dataloader["test"]

(you can use custom tfds.Split objects to create validations datasets or other, see documentation)

train and test are tf.data.Dataset objects so you can use map, apply, batch and similar functions to each of those.

Below is an example, where I will (using tf.image mostly):

convert each image to tf.float64 in the 0-1 range (don't use this stupid snippet from official docs, this way ensures correct image format)
cache() results as those can be re-used after each repeat
randomly flip left_to_right each image
randomly change contrast of image
shuffle data and batch
IMPORTANT: repeat all the steps when dataset is exhausted. This means that after one epoch all of the above transformations are applied again (except for the ones which were cached).

Here is the code doing the above (you can change lambdas to functors or functions):

train = train.map(
    lambda image, label: (tf.image.convert_image_dtype(image, tf.float32), label)
).cache().map(
    lambda image, label: (tf.image.random_flip_left_right(image), label)
).map(
    lambda image, label: (tf.image.random_contrast(image, lower=0.0, upper=1.0), label)
).shuffle(
    100
).batch(
    64
).repeat()

Such tf.data.Dataset can be passed directly to Keras's fit, evaluate and predict methods.

Verifying it actually works like that

I see you are highly suspicious of my explanation, let's go through an example:

1. Get small subset of data

Here is one way to take a single element, admittedly unreadable and unintuitive, but you should be fine with it if you do anything with Tensorflow:

# Horrible API is horrible
element = tfds.load(
    # Take one percent of test and take 1 element from it
    "cifar10",
    as_supervised=True,
    split=tfds.Split.TEST.subsplit(tfds.percent[:1]),
).take(1)

2. Repeat data and check whether it is the same:

Using Tensorflow 2.0 one can actually do it without stupid workarounds (almost):

element = element.repeat(2)
# You can iterate through tf.data.Dataset now, finally...
images = [image[0] for image in element]
print(f"Are the same: {tf.reduce_all(tf.equal(images[0], images[1]))}")

And it unsurprisingly returns:

Are the same: True

3. Check whether data differs after each repeat with random augmentation

Below snippet repeats single element 5 times and checks which are equal and which are different.

element = (
    tfds.load(
        # Take one percent of test and take 1 element
        "cifar10",
        as_supervised=True,
        split=tfds.Split.TEST.subsplit(tfds.percent[:1]),
    )
    .take(1)
    .map(lambda image, label: (tf.image.random_flip_left_right(image), label))
    .repeat(5)
)

images = [image[0] for image in element]

for i in range(len(images)):
    for j in range(i, len(images)):
        print(
            f"{i} same as {j}: {tf.reduce_all(tf.equal(images[i], images[j]))}"
        )

Output (in mine case, each run would be different):

0 same as 0: True
0 same as 1: False
0 same as 2: True
0 same as 3: False
0 same as 4: False
1 same as 1: True
1 same as 2: False
1 same as 3: True
1 same as 4: True
2 same as 2: True
2 same as 3: False
2 same as 4: False
3 same as 3: True
3 same as 4: True
4 same as 4: True

You could cast each of those images to numpy as well and see the images for yourself using skimage.io.imshow, matplotlib.pyplot.imshow or other alternatives.

Another example of visualization of real-time data augmentation

This answer provides a more comprehensive and readable view on data augmentation using Tensorboard and MNIST, might want to check that one out (yeah, shameless plug, but useful I guess).

164

answered Sep 20 '22 14:09

Szymon Maszke

Related questions
                            
                                How to properly overload the __add__ method?
                            
                                Is there a Python API for submitting batch get requests to AWS DynamoDB?
                            
                                Numpy & Pandas: Return histogram values from pandas histogram plot?
                            
                                Python 3 print() function with Farsi/Arabic characters [duplicate]
                            
                                How to record val_loss and loss per batch in keras
                            
                                Overlapping keys in dictionary when Using .replace() method on pandas dataframe
                            
                                How to know what packages are installed with pip
                            
                                Can I send callbacks to a KerasClassifier?
                            
                                How to mock a property
                            
                                Mutual ssl authentication in simple ECHO client/server [Python / sockets / ssl modules], ssl.SSLEOFError: EOF occurred in violation of protocol
                            
                                Selenium working with Chrome, but not headless Chrome
                            
                                How to deal with PyCharm's "Expected type X, got Y instead"
                            
                                Python Plotly - Offline chart embed into HTML (Not working)
                            
                                Am I creating lossless PNG images?
                            
                                Accepting multiple parameters in flask-restful add_resource()
                            
                                Send http request through specific network interface
                            
                                export conda environment without prefix variable which shows local path to executable
                            
                                How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?
                            
                                pip install AttributeError: _DistInfoDistribution__dep_map
                            
                                control initialize order when Python dataclass inheriting a class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to apply data augmentation in TensorFlow 2.0 after tfds.load()

Tags:

python

tensorflow

tensorflow2.0

tensorflow-datasets

data-augmentation