Here I'm attempting to implement a neural network with a single hidden layer to classify two training examples. This network utilizes the sigmoid activation function.
The layers dimensions and weights are as follows :
X : 2X4
w1 : 2X3
l1 : 4X3
w2 : 2X4
Y : 2X3
I'm experiencing an issue in back propagation where the matrix dimensions are not correct. This code :
import numpy as np
M = 2
learning_rate = 0.0001
X_train = np.asarray([[1,1,1,1] , [0,0,0,0]])
Y_train = np.asarray([[1,1,1] , [0,0,0]])
X_trainT = X_train.T
Y_trainT = Y_train.T
A2_sig = 0;
A1_sig = 0;
def sigmoid(z):
    s = 1 / (1 + np.exp(-z))  
    return s
def forwardProp() : 
    global A2_sig, A1_sig;
    w1=np.random.uniform(low=-1, high=1, size=(2, 2))
    b1=np.random.uniform(low=1, high=1, size=(2, 1))
    w1 = np.concatenate((w1 , b1) , axis=1)
    A1_dot = np.dot(X_trainT , w1)
    A1_sig = sigmoid(A1_dot).T
    w2=np.random.uniform(low=-1, high=1, size=(4, 1))
    b2=np.random.uniform(low=1, high=1, size=(4, 1))
    w2 = np.concatenate((w2 , b2) , axis=1)
    A2_dot = np.dot(A1_sig, w2)
    A2_sig = sigmoid(A2_dot)
def backProp() : 
    global A2_sig;
    global A1_sig;
    error1 = np.dot((A2_sig - Y_trainT).T, A1_sig / M)
    print(A1_sig)
    print(error1)
    error2 = A1_sig.T - error1
forwardProp()
backProp()
Returns error :
ValueError                                Traceback (most recent call last)
<ipython-input-605-5aa61e60051c> in <module>()
     45 
     46 forwardProp()
---> 47 backProp()
     48 
     49 # dw2 = np.dot((Y_trainT - A2_sig))
<ipython-input-605-5aa61e60051c> in backProp()
     42     print(A1_sig)
     43     print(error1)
---> 44     error2 = A1_sig.T - error1
     45 
     46 forwardProp()
ValueError: operands could not be broadcast together with shapes (4,3) (2,4) 
How to compute error for previous layer ?
Update :
import numpy as np
M = 2
learning_rate = 0.0001
X_train = np.asarray([[1,1,1,1] , [0,0,0,0]])
Y_train = np.asarray([[1,1,1] , [0,0,0]])
X_trainT = X_train.T
Y_trainT = Y_train.T
A2_sig = 0;
A1_sig = 0;
def sigmoid(z):
    s = 1 / (1 + np.exp(-z))  
    return s
A1_sig = 0;
A2_sig = 0;
def forwardProp() : 
    global A2_sig, A1_sig;
    w1=np.random.uniform(low=-1, high=1, size=(4, 2))
    b1=np.random.uniform(low=1, high=1, size=(2, 1))
    A1_dot = np.dot(X_train , w1) + b1
    A1_sig = sigmoid(A1_dot).T
    w2=np.random.uniform(low=-1, high=1, size=(2, 3))
    b2=np.random.uniform(low=1, high=1, size=(2, 1))
    A2_dot = np.dot(A1_dot , w2) + b2
    A2_sig = sigmoid(A2_dot)
    return(A2_sig)
def backProp() : 
    global A2_sig;
    global A1_sig;
    error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
    error2 = error1 - A1_sig
    return(error1)
print(forwardProp())
print(backProp())
Returns error :
ValueError                                Traceback (most recent call last)
<ipython-input-664-25e99255981f> in <module>()
     47 
     48 print(forwardProp())
---> 49 print(backProp())
<ipython-input-664-25e99255981f> in backProp()
     42 
     43     error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
---> 44     error2 = error1.T - A1_sig
     45 
     46     return(error1)
ValueError: operands could not be broadcast together with shapes (2,3) (2,2) 
Have incorrectly set matrix dimensions ?
Your first weight matrix, w1, should be of shape (n_features, layer_1_size), so when you multiply an input, X of shape (m_examples, n_features) by w1, you get an (m_examples, layer_1_size) matrix. This gets run through the activation of layer 1 and then fed into layer 2 which should have a weight matrix of shape (layer_1_size, output_size), where output_size=3 since you are doing multi-label classification for 3 classes. As you can see, the point is to convert each layer's input into a shape that fits the number of neurons in that layer, or in other words, each input to a layer must feed into every neuron in that layer.
I wouldn't take the transpose of your layer inputs as you have it, I would shape the weight matrices as described so you can compute np.dot(X, w1), etc.
It also looks like you are not handling your biases correctly. When we compute Z = np.dot(w1,X) + b1, b1 should be broadcast so that it is added to every column of the product of w1 and X. This will not happen if you append b1 to your weight matrix as you have it. Rather you should add a column of ones to your input matrix and an additional row to your weight matrix, so the bias terms sit in that row of your weight matrix and the ones in your input ensure they get added everywhere. In this setup you don't need separate b1, b2 terms.
X_train = np.c_(X_train, np.ones(m_examples))
and remember to add one more row to your weights, so w1 should have shape (n_features+1, layer_1_size).
Update for backpropagation:
The goal of backpropagation is to compute the gradient of your error function with respect to your weights and biases and use each result to update each weights matrix and each bias vector.
So you need dE/dw2, dE/db2, dE/dw1, and dE/db1 so you can apply the updates:
w2 <- w2 - learning_rate * dE/dw2
b2 <- b2 - learning_rate * dE/db2
w1 <- w1 - learning_rate * dE/dw1
b1 <- b1 - learning_rate * dE/db1
Since you are doing multilabel classification, you should be using binary crossentropy loss:

You can compute dE/dw2 using the chain rule:
dE/dw2 = (dE/dA2) * (dA/dZ2) * (dZ2/dw2)
I am using Z for your A2_dot since the activation hasn't been applied yet, and I'm using A2 for your A2_sig.
See Notes on Backpropagation [pdf] for a detailed derivation for crossentropy loss with sigmoid activation. This gives a pointwise derivation, however, whereas we are looking for a vectorized implementation, so you will have to do a bit of work to figure out the correct layout for your matrices. There is also no explicit bias vector, unfortunately.
The expression you have for error1 looks correct, but I would call it dw2, and I would just use Y_train instead of taking the transpose twice:
dw2 = (1/m) * np.dot((A2 - Y_train).T , A1)
And you also need db2 which should be:
db2 = (1/m) * np.sum(A2 - Y_train, axis=1, keepdims=True)
You will have to apply the chain rule further to get dw1 and db1, and I'll leave that to you, but there is a nice derivation in Week 3 of the Neural Networks and Deep Learning Coursera Course.
I can't say much about the line you are getting an error on besides that I don't think you should have that calculation in your backprop code, so it makes sense that the dimensions don't match. You might be thinking of the gradient at the output, but I can't think of any similar expression involving A1 for backprop in this network.
This article has a very nice implementation of a one hidden layer neural net in numpy. It does use softmax at the output, but it has sigmoid activations in the hidden layer and otherwise the difference in calculation is minimal. It should help you calculate dw1 and db1 for the hidden layer. Specifically, look at the expression for delta1 in the section titled "A neural network in practice".
Converting their calculation to the notation we're using, and using a sigmoid at the output instead of softmax, it should look like:
dZ2 = A2 - Y_train
dZ1 = np.dot(dZ2, w2.T) * A1 * (1 - A1) # element-wise product
dw2 = (1/m) * np.dot(dZ2, A1.T)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
dw1 = (1/m) * np.dot(dZ1, X_train.T)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With