I'm trying to implement gradient calculation for neural networks using backpropagation. I cannot get it to work with cross entropy error and rectified linear unit (ReLU) as activation.
I managed to get my implementation working for squared error with sigmoid, tanh and ReLU activation functions. Cross entropy (CE) error with sigmoid activation gradient is computed correctly. However, when I change activation to ReLU - it fails. (I'm skipping tanh for CE as it retuls values in (-1,1) range.)
Is it because of the behavior of log function at values close to 0 (which is returned by ReLUs approx. 50% of the time for normalized inputs)? I tried to mitiage that problem with:
log(max(y,eps))
but it only helped to bring error and gradients back to real numbers - they are still different from numerical gradient.
I verify the results using numerical gradient:
num_grad = (f(W+epsilon) - f(W-epsilon)) / (2*epsilon)
The following matlab code presents a simplified and condensed backpropagation implementation used in my experiments:
function [f, df] = backprop(W, X, Y)
% W - weights
% X - input values
% Y - target values
act_type='relu';    % possible values: sigmoid / tanh / relu
error_type = 'CE';  % possible values: SE / CE
N=size(X,1); n_inp=size(X,2); n_hid=100; n_out=size(Y,2);
w1=reshape(W(1:n_hid*(n_inp+1)),n_hid,n_inp+1);
w2=reshape(W(n_hid*(n_inp+1)+1:end),n_out, n_hid+1);
% feedforward
X=[X ones(N,1)];
z2=X*w1'; a2=act(z2,act_type); a2=[a2 ones(N,1)];
z3=a2*w2'; y=act(z3,act_type);
if strcmp(error_type, 'CE')   % cross entropy error - logistic cost function
    f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
else % squared error
    f=0.5*sum(sum((y-Y).^2));
end
% backprop
if strcmp(error_type, 'CE')   % cross entropy error
    d3=y-Y;
else % squared error
    d3=(y-Y).*dact(z3,act_type);
end
df2=d3'*a2;
d2=d3*w2(:,1:end-1).*dact(z2,act_type);
df1=d2'*X;
df=[df1(:);df2(:)];
end
function f=act(z,type) % activation function
switch type
    case 'sigmoid'
        f=1./(1+exp(-z));
    case 'tanh'
        f=tanh(z);
    case 'relu'
        f=max(0,z);
end
end
function df=dact(z,type) % derivative of activation function
switch type
    case 'sigmoid'
        df=act(z,type).*(1-act(z,type));
    case 'tanh'
        df=1-act(z,type).^2;
    case 'relu'
        df=double(z>0);
end
end
Edit
After another round of experiments, I found out that using a softmax for the last layer:
y=bsxfun(@rdivide, exp(z3), sum(exp(z3),2));
and softmax cost function:
f=-sum(sum(Y.*log(y)));
make the implementaion working for all activation functions including ReLU.
This leads me to conclusion that it is the logistic cost function (binary clasifier) that does not work with ReLU:
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
However, I still cannot figure out where the problem lies.
The ANN is implemented using the cross entropy error function in the training stage. The cross entropy function is proven to accelerate the backpropagation algorithm and to provide good overall network performance with relatively short stagnation periods.
Non-linear activation functions solve the following limitations of linear activation functions: They allow backpropagation because now the derivative function would be related to the input, and it's possible to go back and understand which weights in the input neurons can provide a better prediction.
The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight via the chain rule, computing the gradient layer by layer, and iterating backward from the last layer to avoid redundant computation of intermediate terms in the chain rule.
Cross-entropy loss is calculated by taking the difference between our prediction and actual output. We then multiply that value with `-y * ln(y)`. This means we take a negative number, raise it to the power of the logarithm of y (which will be positive), and then subtract this from our original calculation.
Every squashing function sigmoid, tanh and softmax (in the output layer) means different cost functions. Then makes sense that a RLU (in the output layer) does not match with the cross entropy cost function. I will try a simple square error cost function to test a RLU output layer.
The true power of RLU is in the hidden layers of a deep net since it not suffer from gradient vanishing error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With