I often see in many Tensorflow tutorials text like:
To do this calculation, you need the column means. You would obviously need to compute these in real life, but for this example we'll just provide them.
For small or medium sized CSV datasets computing the mean is as easy as a pandas method on a dataframe or using `scikit-learn
BUT, if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics. Tensorflow Transform claims that it can calculate global summary statistics, but they don't really explain how this work or how to integrate this into a workflow.
Here is the code example from their getting started guide.
import tensorflow as tf
import tensorflow_transform as tft
def preprocessing_fn(inputs):
  x = inputs['x']
  y = inputs['y']
  s = inputs['s']
  x_centered = x - tft.mean(x)
  y_normalized = tft.scale_to_0_1(y)
  s_integerized = tft.compute_and_apply_vocabulary(s)
  x_centered_times_y_normalized = x_centered * y_normalized
  return {
      'x_centered': x_centered,
      'y_normalized': y_normalized,
      'x_centered_times_y_normalized': x_centered_times_y_normalized,
      's_integerized': s_integerized
  }
The documentation says that this code will run tft.mean(x) over the entire dataset, but it is not clear how that will happen since x is limited to just the scope of the batch? Yet here is the claim in the documentation. 
While not obvious in the example above, the user defined preprocessing function is passed tensors representing batches and not individual instances, as happens during training and serving with TensorFlow. On the other hand, analyzers perform a computation over the entire dataset that returns a single value and not a batch of values. x is a Tensor with a shape of (batch_size,), while tft.mean(x) is a Tensor with a shape of ().
So the questions are
Does tft.mean() run over the entire dataset first, and only after computing the global mean does it begin to load batches?
Are there any more detailed or complete examples of using tft.transforms in a workflow? Like can these tranforms be included in a single batch preprocessing function on a tf.data.Dataset.map() call, or how?
So if I was trying to write some code to calculate the average age of individuals in my tensorflow dataset. Here is the code I have so far. Is this the best way to do something like this, or is there a better way? 
I used the tensorflow-2.0 make_csv_dataset() which takes care of stacking the examples from the CSV file into a column structure. Note I took the code for the make_csv_dataset() from the new tutorial on the tensorflow website referenced in the link above. 
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=32, 
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
 ds_iter = dataset.make_one_shot_iterator()
 list_of_batch_means = []
 for ex_features, ex_labels in ds_iter:
    batch_length = len(ex_features)
    batch_sum = tf.reduce_sum(ex_features['age'])
    list_of_batch_means.append(batch_sum/len(ex_features)
 average_age = np.mean(list_of_batch_means)
As a caveat, I divided the batch_sum/len(ex_features) since the final batch will not necessarily be the same size as the other batches, hence I did that calculate manually instead of using tf.reduce_mean().This might be a minor issue if you have a lot of batches, but just wanted to be as accurate as possible. 
Any suggestions would be appreciated.
The most important concept of tf.transform is preprocessing function. The preprocessing function is the logical description of the transformation of the dataset. A preprocessing function accepts and returns a dictionary of Tensors. There are 2 kinds of functions(steps) used to define a preprocessing function :
Analyze step: It iterates through the whole dataset and creates a graph. So, for example in order to calculate mean, we pass the full dataset to calculate the average of particular column of that dataset (This step requires the full pass of the dataset)
Transform step: It basically uses the graph that has been created in the analyze step and transforms the complete dataset.
So, basically the constants calculated in the analyze step is used in the Transform step.
For better understanding, you can go through this video followed by this presentation which should solidify your understanding of how Tensorflow Transform works internally.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With