Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras classifier accuracy steadily increases during training then drops to 0.25 (local minimum?)

I have the following neural network, written in Keras using Tensorflow as the backend, which I'm running on Python 3.5 (Anaconda) on Windows 10:

    model = Sequential() 
    model.add(Dense(100, input_dim=283, init='normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(150, init='normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(200, init='normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(200, init='normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(200, init='normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(4, init='normal', activation='sigmoid'))
    sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

I'm training on my GPU. During training (10000 epochs), the accuracy of the naive network steadily increases from 0.25 to somewhere between 0.7 and 0.9, before suddenly dropping and sticking at 0.25:

    Epoch 1/10000
    6120/6120 [==============================] - 1s - loss: 1.5329 - acc: 0.2665
    Epoch 2/10000
    6120/6120 [==============================] - 1s - loss: 1.2985 - acc: 0.3784
    Epoch 3/10000
    6120/6120 [==============================] - 1s - loss: 1.2259 - acc: 0.4891
    Epoch 4/10000
    6120/6120 [==============================] - 1s - loss: 1.1867 - acc: 0.5208
    Epoch 5/10000
    6120/6120 [==============================] - 1s - loss: 1.1494 - acc: 0.5199
    Epoch 6/10000
    6120/6120 [==============================] - 1s - loss: 1.1042 - acc: 0.4953
    Epoch 7/10000
    6120/6120 [==============================] - 1s - loss: 1.0491 - acc: 0.4982
    Epoch 8/10000
    6120/6120 [==============================] - 1s - loss: 1.0066 - acc: 0.5065
    Epoch 9/10000
    6120/6120 [==============================] - 1s - loss: 0.9749 - acc: 0.5338
    Epoch 10/10000
    6120/6120 [==============================] - 1s - loss: 0.9456 - acc: 0.5696
    Epoch 11/10000
    6120/6120 [==============================] - 1s - loss: 0.9252 - acc: 0.5995
    Epoch 12/10000
    6120/6120 [==============================] - 1s - loss: 0.9111 - acc: 0.6106
    Epoch 13/10000
    6120/6120 [==============================] - 1s - loss: 0.8772 - acc: 0.6160
    Epoch 14/10000
    6120/6120 [==============================] - 1s - loss: 0.8517 - acc: 0.6245
    Epoch 15/10000
    6120/6120 [==============================] - 1s - loss: 0.8170 - acc: 0.6345
    Epoch 16/10000
    6120/6120 [==============================] - 1s - loss: 0.7850 - acc: 0.6428
    Epoch 17/10000
    6120/6120 [==============================] - 1s - loss: 0.7633 - acc: 0.6580
    Epoch 18/10000
    6120/6120 [==============================] - 4s - loss: 0.7375 - acc: 0.6717
    Epoch 19/10000
    6120/6120 [==============================] - 1s - loss: 0.7058 - acc: 0.6850
    Epoch 20/10000
    6120/6120 [==============================] - 1s - loss: 0.6787 - acc: 0.7018
    Epoch 21/10000
    6120/6120 [==============================] - 1s - loss: 0.6557 - acc: 0.7093
    Epoch 22/10000
    6120/6120 [==============================] - 1s - loss: 0.6304 - acc: 0.7208
    Epoch 23/10000
    6120/6120 [==============================] - 1s - loss: 0.6052 - acc: 0.7270
    Epoch 24/10000
    6120/6120 [==============================] - 1s - loss: 0.5848 - acc: 0.7371
    Epoch 25/10000
    6120/6120 [==============================] - 1s - loss: 0.5564 - acc: 0.7536
    Epoch 26/10000
    6120/6120 [==============================] - 1s - loss: 0.1787 - acc: 0.4163
    Epoch 27/10000
    6120/6120 [==============================] - 1s - loss: 1.1921e-07 - acc: 0.2500
    Epoch 28/10000
    6120/6120 [==============================] - 1s - loss: 1.1921e-07 - acc: 0.2500
    Epoch 29/10000
    6120/6120 [==============================] - 1s - loss: 1.1921e-07 - acc: 0.2500
    Epoch 30/10000
    6120/6120 [==============================] - 2s - loss: 1.1921e-07 - acc: 0.2500
    Epoch 31/10000
    6120/6120 [==============================] - 1s - loss: 1.1921e-07 - acc: 0.2500
    Epoch 32/10000
    6120/6120 [==============================] - 1s - loss: 1.1921e-07 - acc: 0.2500 ...

I'm guessing that this is due to the optimiser falling into a local minimum where it assigns all data to one category. How can I inhibit it from doing this?

Things I've tried (but didn't seem to stop this from happening):

  1. Using a different optimiser (adam)
  2. Ensuring that the training data included an equal number of examples from each category
  3. Increasing the volume of training data (currently at 6000)
  4. Varying the number of categories between 2 to 5
  5. Increasing the number of hidden layers in the network from 1 to 5
  6. Changing the width of the layers (from 50 to 500)

None of these helped. Any other ideas why this is happening and/or how to inhibit it? Could it be a bug in Keras? Many thanks in advance for any suggestions.

Edit: The problem appears to have been solved by changing the final activation to softmax (from sigmoid) and adding maxnorm(3) regularization to the final two hidden layers:

model = Sequential() 
model.add(Dense(100, input_dim=npoints, init='normal', activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(150, init='normal', activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(200, init='normal', activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(200, init='normal', activation='relu', W_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Dense(200, init='normal', activation='relu', W_constraint=maxnorm(3)))
model.add(Dropout(0.2))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.add(Dense(ncat, init='normal', activation='softmax'))
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])

Many thanks for the suggestions.

like image 749
SWS Avatar asked Dec 05 '25 05:12

SWS


1 Answers

The problem lied in sigmoid function as an activation in a last layer. In this case the output of your final layer cannot be interpreted as a probability distribution of an example given belonging to a single class. The output from this layer usually doesn't even sum up to 1. In this case the optimization may lead to unexpected behaviour. In my opinion adding a maxnorm constrain is not necessary but I strongly advise you to use a categorical_crossentropy instead of mse loss as it's proven that this function works better for this optimization case.

like image 52
Marcin Możejko Avatar answered Dec 08 '25 04:12

Marcin Możejko



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!