If I understand branching correctly (x86), the processor will sometimes speculatively take a code path and perform the instructions and 'cancel' the results of the wrong path. What if the operation in the wrong codepath is very expensive, like a memory read that causes a cache miss or some expensive math operation? Will the processor try to perform something expensive ahead of time? How would a processor typically handle this?
if (likely) {
    // do something lightweight (addition, subtraction, etc.)
} else {
    // do something expensive (cache-miss, division, sin/cos/tan etc.)
}
Speculative execution is one of the main techniques used by most modern high performance processors to improve performance. The concept behind speculative execution is that instructions are executed ahead of knowing that they are required.
The goal is to reduce the total execution time and improve the overall CPU performance.
Modern superscalar processors rely heavily on speculative execution for performance.
A disadvantage of speculative execution is that the results may be discarded if it is discovered that the instructions were not needed after all. The most common form of speculative execution involves the control flow of a program.
tl:dr: the impact isn't as bad as you think, because the CPU no longer has to wait for slow things, even if it doesn't cancel them. Almost everything is heavily pipelined, so many operations can be in flight at once. The mis-speculated operations don't prevent new ones from starting.
Current x86 designs do not speculate on both sides of a branch at once. They only speculate down the predicted path.
I'm not aware of any specific microarchitecture that does speculate along both ways of a branch in any circumstances, but that doesn't mean there aren't any. I've mostly only read up on x86 microarchitectures (see the tag wiki for links to Agner Fog's microarch gude). I'm sure it's been suggested in academic papers, and maybe even implemented in a real design somewhere.
I'm not sure exactly what happens in current Intel and AMD designs when a branch mispredict is detected while a cache-miss load or store is already executing pending, or a divide is occupying the divide unit. Certainly out-of-order execution doesn't have to wait for the result, because no future uops depend on it.
On uarches other than P4, bogus uops in the ROB/scheduler are discarded when a mispredict is detected. From Agner Fog's microarch doc, talking about P4 vs. other uarches:
the misprediction penalty is unusually high for two reasons ... [long pipeline and] ... bogus μops in a mispredicted branch are not discarded before they retire. A misprediction typically involves 45 μops. If these μops are divisions or other time-consuming operations then the misprediction can be extremely costly. Other microprocessors can discard μops as soon as the misprediction is detected so that they don't use execution resources unnecessarily.
uops that are currently occupying execution units are another story:
Almost all execution units except the divider are fully pipelined, so another multiply, shuffle, or whatever can start without cancelling an in-flight FP FMA. (Haswell: 5 cycle latency, two execution units each capable of one per clock throughput, for a total sustained throughput of one per 0.5c. This means max throughput requires keeping 10 FMAs in flight at once, typically with 10 vector accumulators). Divide is interesting, though. Integer divide is many uops, so a branch mispredict will at least stop issuing them. FP div is only a single uop instruction, but not fully pipelined, esp. in older CPUs. It would be useful to cancel an FP div that was tieing up the divide unit, but IDK if that's possible. If adding the ability to cancel would have slowed down the normal case, or cost more power, then it would probably be left out. It's a rare special case which probably wasn't worth spending transistors on.
x87 fsin or something is a good example of a really expensive instruction.  I didn't notice that until I went back to re-read the question.  It's microcoded, so even though it has a latency of 47-106 cycles (Intel Haswell), it's also 71-100 uops.  A branch mispredict would stop the frontend from issuing the remaining uops, and cancel all the ones that are queued, like I said for integer division.  Note that real libm implementations typically don't use fsin and so on because they're slower and less accurate than what can be achieved in software (even without SSE), IIRC.
For a cache-miss, it might be cancelled, potentially saving bandwidth in L3 cache (and maybe main memory). Even if not, the instruction no longer has to retire, so the ROB won't fill up waiting for it to finish. That's normally why cache misses hurt OOO execution so much, but here it's at worst just tieing up a load or store buffer. Modern CPUs can have many outstanding cache misses in flight at once. Often code doesn't make this possible because future operations depend on the result of a load that missed in cache (e.g. pointer chasing in a linked list or tree), so multiple memory operations can't be pipelined. Even if a branch mispredict doesn't cancel much of an in-flight memory op, it avoids most of the worst effects.
I have heard of putting a ud2 (illegal instruction) at the end of a block of code to stop instruction prefetch from triggering a TLB miss when the block is at the end of a page.  I'm not sure when this technique is necessary.  Maybe if there's a conditional branch that's always actually taken?  That doesn't make sense, you'd just use an unconditional branch.  There must be something I'm not remembering about when you'd do that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With