I've implemented backpropagation as explained in this video. https://class.coursera.org/ml-005/lecture/51
This seems to have worked successfully, passing gradient checking and allowing me to train on MNIST digits.
However, I've noticed most other explanations of backpropagation calculate the output delta as
d = (a - y) * f'(z) http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
whilst the video uses.
d = (a - y).
When I multiply my delta by the activation derivative (sigmoid derivative), I no longer end up with the same gradients as gradient checking (at least an order of magnitude in difference).
What allows Andrew Ng (video) to leave out the derivative of the activation for the output delta? And why does it work? Yet when adding the derivative, incorrect gradients are calculated?
EDIT
I have now tested with linear and sigmoid activation functions on the output, gradient checking only passes when I use Ng's delta equation (no sigmoid derivative) for both cases.
Found my answer here. The output delta does require multiplication by the derivative of the activation as in.
d = (a - y) * g'(z)
However, Ng is making use of the cross-entropy cost function which results in a delta that cancels the g'(z) resulting in the d = a - y calculation shown in the video. If a mean squared error cost function is used instead, the derivative of the activation function must be present.
When using Neural Networks it depends on the learning task how you need to design your network. A common approach for regression tasks is to use the tanh() activation functions for the input and all hidden layers and then the output layer uses an linear activation function (img taken from here)

I did' not find the source, but there was an theorem which states that using non-linear together with linear activaion functions allows you to better approximate the target functions. An example of using different activation functions can be found here and here.
The are many different kinds of acitvation function which can be used (img taken from here). If you look at the derivatives you can see that the derivative of the linar function equals to 1 which then will not be mentions anymore. This is also the case for Ng,s explanation, if you look at minute 12 in the video you see that he is talking about the outputlayer.

Concerning the Backpropagation-Algorithm
"When neuron is located in the output layer of the network, it is supplied with a desired response of its own. We may use e(n) = d(n) - y(n) to compute the error signal e(n) associated with this neuron; see Fig. 4.3. Having determined e(n), we find it a straightforward matter to compute the local gradient [...] When neuron is located in a hidden layer of the network, there is no specified desired response for that neuron. Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron is directly connected"
Haykin, Simon S., et al. Neural networks and learning machines. Vol. 3. Upper Saddle River: Pearson Education, 2009. p 159-164
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With