I was looking through the code of Caffe's SigmoidCrossEntropyLoss layer and the docs and I'm a bit confused. The docs list the loss function as the logit loss (I'd replicate it here, but without Latex, the formula would be difficult to read. Check out the docs link, it's at the very top).
However, the code itself (Forward_cpu(...)) shows a different formula
Dtype loss = 0;
for (int i = 0; i < count; ++i) {
    loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
        log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
}
top[0]->mutable_cpu_data()[0] = loss / num;
Is it because this is accounting for the sigmoid function having already been applied to the input?
However, even so, the (input_data[i] >= 0) snippets are confusing me as well. Those appear to be in place of the p_hat from the loss formula in the docs, which is supposed to be the prediction squashed by the sigmoid function. So why are they just taking a binary threshold? It's made even more confusing as this loss predicts [0,1] outputs, so (input_data[i] >= 0) will be a 1 unless it's 100% sure it's not.
Can someone please explain this to me?
The SigmoidCrossEntropy layer in caffe combines 2 steps(Sigmoid + CrossEntropy) that will perform on input_data into one piece of code :
Dtype loss = 0;
for (int i = 0; i < count; ++i) {
    loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
        log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
}
top[0]->mutable_cpu_data()[0] = loss / num;
In fact, no matter whether input_data >= 0 or not,  the above code is always equivalent to the following code in math:
Dtype loss = 0;
for (int i = 0; i < count; ++i) {
    loss -= input_data[i] * (target[i] - 1) -
        log(1 + exp(-input_data[i]);
}
top[0]->mutable_cpu_data()[0] = loss / num;
, this code is based on the straightforward math formula after applying Sigmoid and CrossEntropy on input_data and making some combinations in math.
But the first piece of code(caffe uses) owns more numerical stability and takes less risk of overflow, because it avoids calculating a large exp(input_data)(or exp(-input_data)) when the absolute value of input_data is too large. That's why you saw that code in caffe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With