I've had always been curious about the cost of jumps in assembly.
cmp ecx, edx
je SOME_LOCATION # What's the cost of this jump?
Does it need to do a search in a lookup table for each jumps or how does it work?
Originally (e.g. 8086) the cost of a jump wasn't much different to the cost of a mov
.
Later CPUs added caches, which meant some jumps were faster (because the code they jump to is in the cache) and some jumps were slower (because the code they jump to isn't in the cache).
Even later CPUs added "out of order" execution, where conditional branches (e.g. je SOME_LOCATION
) would have to wait until the flags from "previous instructions that happen to be executed in parallel" became known.
This means that a sequence like
mov esi, edi
cmp ecx, edx
je SOME_LOCATION
can be slower than rearranging it to
cmp ecx, edx
mov esi, edi
je SOME_LOCATION
to increase the chance that the flags would be known.
Even later CPUs added speculative execution. In this case, for conditional branches the CPU just takes a guess at where it will branch to before it actually knows (e.g. before the flags are known), and if it guesses wrong it'll just pretend that it didn't execute the wrong instructions. More specifically, the speculatively executed instructions are tagged at the start of the pipeline and held at the end of the pipeline (at retirement) until the CPU knows if they can be committed to visible state or if they have to be discarded.
After that things just got more complicated, with fancier methods of doing branch prediction, additional "branch target" buffers, etc.
Far jumps that change the code segment are more expensive. In real mode it's not so bad because the CPU mostly only does "CS.base = value * 16" when CS
is changed. For protected mode it's a table lookup (to find GDT or LDT entry), decoding the entry, deciding what to do based on what kind of entry it is, then a pile of protection checks. For long mode it's vaguely similar. All of this adds more uncertainty (e.g. with the table entry be in cache?).
On top of all of this there's things like TLB misses. For example, jmp [indirectAddress]
can cause a TLB miss at indirectAddress
then a TLB miss at the stack top then a TLB miss at the new instruction pointer; where each TLB miss can cost a few hundred cycles.
Mostly; the cost of a jump can be anything from 0 cycles (for a correctly predicted jump) to maybe 1000 cycles; depending on which CPU it is, what kind of jump, what is in caches, what branch prediction predicts, cache/TLB misses, how fast/slow RAM is, and anything I may have forgotten.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With