Suppose that I have a model like this (this is a model for time series forecasting): <pre class="prettyprint"><code>ipt = Input((data.shape[1] ,data.shape[2])) # 1 x = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2 x = LSTM(15, return_sequences = False)(x) # 3 x = BatchNormalization()(x) # 4 out = Dense(1, activation = 'relu')(x) # 5 </code></pre> Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before <code>Conv1D</code> layer? I think it's rational to have a batch normalization layer after <code>LSTM</code>. Also, where can I add Dropout in this network? The same places? (after or before batch normalization?) <ul> <li>What about adding <code>AveragePooling1D</code> between <code>Conv1D</code> and <code>LSTM</code>? Is it possible to add batch normalization between <code>Conv1D</code> and <code>AveragePooling1D</code> in this case without any effect on <code>LSTM</code> layer?</li> </ul>

Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior. <hr> <code>BatchNormalization</code> can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated <code>LayerNormalization</code>. Now to your case: <ul> <li> "Can I add it before <code>Conv1D</code>"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing</li> <li>Try both: <code>BatchNormalization</code> before an activation, and after - apply to both <code>Conv1D</code> and <code>LSTM</code> </li> <li>If your model is exactly as you show it, <code>BN</code> after <code>LSTM</code> may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not <code>LSTM</code> </li> <li>If you aren't using stacked <code>LSTM</code> with <code>return_sequences=True</code> preceding <code>return_sequences=False</code>, you can place <code>Dropout</code> anywhere - before <code>LSTM</code>, after, or both</li> <li> Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance</li> <li> <code>recurrent_dropout</code> is still preferable to <code>Dropout</code> for <code>LSTM</code> - however, you can do both; just do not use with with <code>activation='relu'</code>, for which <code>LSTM</code> is unstable per a bug</li> <li>For data of your dimensionality, any sort of <code>Pooling</code> is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops</li> <li>I strongly recommend a <code>SqueezeExcite</code> block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below</li> <li>I also recommend trying <code>activation='selu'</code> with <code>AlphaDropout</code> and <code>'lecun_normal'</code> initialization, per paper Self Normalizing Neural Networks </li> <li> Disclaimer: above advice may not apply to NLP and embed-like tasks</li> </ul> Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients <pre class="prettyprint lang-py prettyprint-override"><code>from keras.layers import Input, Dense, LSTM, Conv1D, Activation from keras.layers import AlphaDropout, BatchNormalization from keras.layers import GlobalAveragePooling1D, Reshape, multiply from keras.models import Model import keras.backend as K import numpy as np def make_model(batch_shape): ipt = Input(batch_shape=batch_shape) x = ConvBlock(ipt) x = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x) # x = BatchNormalization()(x) # may or may not work well out = Dense(1, activation='relu') model = Model(ipt, out) model.compile('nadam', 'mse') return model def make_data(batch_shape): # toy data return (np.random.randn(*batch_shape), np.random.uniform(0, 2, (batch_shape[0], 1))) batch_shape = (32, 21, 20) model = make_model(batch_shape) x, y = make_data(batch_shape) model.train_on_batch(x, y) </code></pre> Functions used: <pre class="prettyprint lang-py prettyprint-override"><code>def ConvBlock(_input): # cleaner code x = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False, kernel_initializer='lecun_normal')(_input) x = BatchNormalization(scale=False)(x) x = Activation('selu')(x) x = AlphaDropout(0.1)(x) out = SqueezeExcite(x) return out def SqueezeExcite(_input, r=4): # r == "reduction factor"; see paper filters = K.int_shape(_input)[-1] se = GlobalAveragePooling1D()(_input) se = Reshape((1, filters))(se) se = Dense(filters//r, activation='relu', use_bias=False, kernel_initializer='he_normal')(se) se = Dense(filters, activation='sigmoid', use_bias=False, kernel_initializer='he_normal')(se) return multiply([_input, se]) </code></pre> <hr> Spatial Dropout: pass <code>noise_shape = (batch_size, 1, channels)</code> to <code>Dropout</code> - has the effect below; see Git gist for code: <img src="https://i.stack.imgur.com/gqdNB.png" width="500">

Batch normalization layer for CNN-LSTM

Tags:

tensorflow

keras

lstm

conv-neural-network

batch-normalization

Suppose that I have a model like this (this is a model for time series forecasting):

ipt   = Input((data.shape[1] ,data.shape[2])) # 1
x     = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x     = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out   = Dense(1, activation = 'relu')(x) # 5

Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D layer? I think it's rational to have a batch normalization layer after LSTM.

Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)

What about adding AveragePooling1D between Conv1D and LSTM? Is it possible to add batch normalization between Conv1D and AveragePooling1D in this case without any effect on LSTM layer?

957

asked Dec 11 '19 11:12

Eghbal

1 Answers

Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.

BatchNormalization can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization. Now to your case:

"Can I add it before Conv1D"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing
Try both: BatchNormalization before an activation, and after - apply to both Conv1D and LSTM
If your model is exactly as you show it, BN after LSTM may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
If you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both
Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance
recurrent_dropout is still preferable to Dropout for LSTM - however, you can do both; just do not use with with activation='relu', for which LSTM is unstable per a bug
For data of your dimensionality, any sort of Pooling is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops
I strongly recommend a SqueezeExcite block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below
I also recommend trying activation='selu' with AlphaDropout and 'lecun_normal' initialization, per paper Self Normalizing Neural Networks
Disclaimer: above advice may not apply to NLP and embed-like tasks

Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients

from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np


def make_model(batch_shape):
    ipt = Input(batch_shape=batch_shape)
    x   = ConvBlock(ipt)
    x   = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
    # x   = BatchNormalization()(x)  # may or may not work well
    out = Dense(1, activation='relu')

    model = Model(ipt, out)
    model.compile('nadam', 'mse')
    return model

def make_data(batch_shape):  # toy data
    return (np.random.randn(*batch_shape),
            np.random.uniform(0, 2, (batch_shape[0], 1)))

batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y  = make_data(batch_shape)

model.train_on_batch(x, y)

Functions used:

def ConvBlock(_input):  # cleaner code
    x   = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
                 kernel_initializer='lecun_normal')(_input)
    x   = BatchNormalization(scale=False)(x)
    x   = Activation('selu')(x)
    x   = AlphaDropout(0.1)(x)
    out = SqueezeExcite(x)    
    return out

def SqueezeExcite(_input, r=4):  # r == "reduction factor"; see paper
    filters = K.int_shape(_input)[-1]

    se = GlobalAveragePooling1D()(_input)
    se = Reshape((1, filters))(se)
    se = Dense(filters//r, activation='relu',    use_bias=False,
               kernel_initializer='he_normal')(se)
    se = Dense(filters,    activation='sigmoid', use_bias=False, 
               kernel_initializer='he_normal')(se)
    return multiply([_input, se])

Spatial Dropout: pass noise_shape = (batch_size, 1, channels) to Dropout - has the effect below; see Git gist for code:

150

answered Oct 18 '22 02:10

OverLordGoldDragon

Related questions
                            
                                Using flat_map in Tensorflow's Dataset API
                            
                                Python 3.6 in tensorflow gpu docker images
                            
                                Keras Layer Concatenation
                            
                                What's the best way of centre cropping images in python?
                            
                                module 'tensorflow.python.keras.datasets.fashion_mnist' has no attribute 'load_data'
                            
                                What does "--logtostderr" mean in the command line while using tensorflow's object detection api?
                            
                                Tensorflow: How to tile a tensor that duplicate in certain order? [duplicate]
                            
                                Tensorflow Hub : Stuck while importing a model
                            
                                I trained a keras model on google colab. Now not able to load it locally on my system.
                            
                                How to save a Tensorflow.js model?
                            
                                keras combining two losses with adjustable weights where the outputs do not have the same dimensionality
                            
                                Tensorflow.keras.layers "unresolved reference" in pycharm
                            
                                Tensorflow Object Detection - Convert .pb file to tflite
                            
                                Why is TimeDistributed not needed in my Keras LSTM?
                            
                                Which installer to use for Miniconda with Python 3.6?
                            
                                Setting a random seed on TF 2.0
                            
                                Why would this dataset implementation run out of memory?
                            
                                Which numpy versions are compatible with Tensorflow 1.14.0
                            
                                tf.cast equivalent in pytorch?
                            
                                TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [bool, float32] that don't all match

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With