def conv2d_bn(x, nb_filter, nb_row, nb_col,
              border_mode='same', subsample=(1, 1),
              name=None):
    '''Utility function to apply conv + BN.
    '''
    x = Convolution2D(nb_filter, nb_row, nb_col,
                      subsample=subsample,
                      activation='relu',
                      border_mode=border_mode,
                      name=conv_name)(x)
    x = BatchNormalization(axis=bn_axis, name=bn_name)(x)
    return x
When I use official inception_v3 model in keras, I find that they use BatchNormalization after 'relu' nonlinearity as above code script.
But in the Batch Normalization paper, the authors said
we add the BN transform immediately before the nonlinearity, by normalizing x=Wu+b.
Then I view the implementation of inception in tensorflow which add BN immediately before the nonlinearity as they said. For more details in inception ops.py
I'm confused. Why do people use above style in Keras other than the following?
def conv2d_bn(x, nb_filter, nb_row, nb_col,
              border_mode='same', subsample=(1, 1),
              name=None):
    '''Utility function to apply conv + BN.
    '''
    x = Convolution2D(nb_filter, nb_row, nb_col,
                      subsample=subsample,
                      border_mode=border_mode,
                      name=conv_name)(x)
    x = BatchNormalization(axis=bn_axis, name=bn_name)(x)
    x = Activation('relu')(x)
    return x
In the Dense case:
x = Dense(1024, name='fc')(x)
x = BatchNormalization(axis=bn_axis, name=bn_name)(x)
x = Activation('relu')(x)
I also use it before the activation, which is indeed how it was designed, and so do other libraries, such as lasagne's batch_norm http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html#lasagne.layers.batch_norm .
However it seems that in practice placing it after the activation works a bit better:
https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md (this is just one benchmark though)
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015) recommend the latter. More specifically, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Keep in mind that there have been reports of models getting better results when using batch normalization after the activation, so it is probably worthwhile to test your model using both configurations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With