Let's say we have a processor with two cores (C0 and C1) and a cache line starting at address k that is owned by C0 initially. If C1 issues a store instruction on a 8-byte slot at line k, will that affect the throughput of the following instructions that are being executed on C1? 
The intel optimziation manual has the following paragraph
When an instruction writes data to a memory location [...], the processor ensures that it has the line containing this memory location is in its L1d cache [...]. If the cache line is not there, it fetches from the next levels using a RFO request [...] RFO and storing the data happens after instruction retirement. Therefore, the store latency usually does not affect the store instruction itself
With reference to the following code,
// core c0
foo();
line(k)->at(i)->store(kConstant, std::memory_order_release);
bar();
baz();
The quote from the intel manual makes me assume that in the code above, the execution of the code will look as if the store was essentially a no-op, and would not impact the latency between the end of foo() and the start of bar().  In contrast, for the following code, 
// core c0
foo();
bar(line(k)->at(i)->load(std::memory_order_acquire));
baz();
The latency between the end of foo() and the start of bar() would be impacted by the load, as the following code has the result of the load as a dependency.
This question is mostly concerned with how intel processors (in the Broadwell family or newer) work for the case above. Also, in particular, for how C++ code that looks like the above gets compiled down to assembly for those processors.
A cache miss occurs either because the data was never placed in the cache, or because the data was removed (“evicted”) from the cache by either the caching system itself or an external application that specifically made that eviction request.
The instruction cache contains predecode bits to demarcate individual instructions to improve decode speed. One of the challenges with the IA-32 and Intel 64 ISA is that instructions are variable in length. In other words, the size of the instruction is not known until the instruction has been partially decoded.
The instruction cache stores the individual instructions for the CPU of the currently executing program. It is the program itself. Main memory is often too slow (or has too much latency) to be able to feed the CPU its next instruction every time it is ready for one.
Generally speaking, for a store that is not soon read by subsequent code, the store doesn't directly delay that subsequent code on any modern out-of-order processor, including Intel.
For example:
foo()
*x = y;
bar()
If foo() doesn't modify x or y, and bar doesn't load from *x, the store is independent and may start executing even before foo() is complete (or even before it starts), and bar() may execute before the store commits to the cache, and bar() may even execute while foo() is running, etc.
While there is little direct impact, it doesn't meant there aren't indirect impacts and indeed the store may dominate the execution time.
If the store misses in cache, it may tie up off-core resources while the cache miss is satisfied. It also usually prevent subsequent stores from draining, which may be a bottleneck: if the store buffer fills up, the front-end blocks entirely and new instructions no longer enter the scheduler.
Finally, everything depends on the details of the surrounding code, as usual. If that sequence is run repeatedly, and foo() and bar() are short, the misses related to the store may dominate the runtime. After all, buffering can't hide the cost of an unlimited number of stores. At some point you'll be bound by the intrinsic throughput of the stores.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With