I have a simple cost function, which I want to optimize using scipy.optimize.minimize function.
opt_solution = scipy.optimize.minimize(costFunction, theta, args = (training_data,), method = 'L-BFGS-B', jac = True, options = {'maxiter': 100)
where costFunction
is the function to be optimized, theta
are the parameters to be optimized. Inside costFunction
, I printed the value of cost function. But the parameter maxiter
seems to have no effect whether I increase value from 10 to 100000. The time it is taking is same. Also, I was expecting the printed value of cost function should be equal to the values of maxiter
. So I am feeling maxiter
has no effect. What might be the problem ?
Cost function is
def costFunction(self, theta, input):
""" Extract weights and biases from 'theta' input """
W1 = theta[self.limit0 : self.limit1].reshape(self.hidden_size, self.visible_size)
W2 = theta[self.limit1 : self.limit2].reshape(self.visible_size, self.hidden_size)
b1 = theta[self.limit2 : self.limit3].reshape(self.hidden_size, 1)
b2 = theta[self.limit3 : self.limit4].reshape(self.visible_size, 1)
""" Compute output layers by performing a feedforward pass
Computation is done for all the training inputs simultaneously """
hidden_layer = self.sigmoid(numpy.dot(W1, input) + b1)
output_layer = self.sigmoid(numpy.dot(W2, hidden_layer) + b2)
""" Compute intermediate difference values using Backpropagation algorithm """
diff = output_layer - input
sum_of_squares_error = 0.5 * numpy.sum(numpy.multiply(diff, diff)) / input.shape[1]
weight_decay = 0.5 * self.lamda * (numpy.sum(numpy.multiply(W1, W1)) + numpy.sum(numpy.multiply(W2, W2)))
cost = sum_of_squares_error + weight_decay
""" Compute the gradient values by averaging partial derivatives
Partial derivatives are averaged over all training examples """
W1_grad = numpy.dot(del_hid, numpy.transpose(input))
W2_grad = numpy.dot(del_out, numpy.transpose(hidden_layer))
b1_grad = numpy.sum(del_hid, axis = 1)
b2_grad = numpy.sum(del_out, axis = 1)
W1_grad = W1_grad / input.shape[1] + self.lamda * W1
W2_grad = W2_grad / input.shape[1] + self.lamda * W2
b1_grad = b1_grad / input.shape[1]
b2_grad = b2_grad / input.shape[1]
""" Transform numpy matrices into arrays """
W1_grad = numpy.array(W1_grad)
W2_grad = numpy.array(W2_grad)
b1_grad = numpy.array(b1_grad)
b2_grad = numpy.array(b2_grad)
""" Unroll the gradient values and return as 'theta' gradient """
theta_grad = numpy.concatenate((W1_grad.flatten(), W2_grad.flatten(),
b1_grad.flatten(), b2_grad.flatten()))
# Update counter value
self.counter += 1
print "Index ", self.counter, "cost ", cost
return [cost, theta_grad]
maxiter
gives the maximum number of iterations that scipy will try before giving up on improving the solution. But it may very well be satisfied with a solution and stop earlier.
If you look at the docs for minimize
when using the 'l-bfgs-b'
method, notice there are three parameters you can pass as options (factr
, ftol
and gtol
) that can also cause the iteration to stop.
In simple cases like yours, especially if your cost function also provides the gradient (as indicated by jac=True
in your call), convergence typically happens in the first few iterations, hence way before the maxiter
limit is reached.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With