Consider a multi-core ARM processor. One thread is modifying a machine code block which is maybe being executed concurrently by another thread. The modifying thread does the following kinds of changes:
For the code writer thread I understand that it is enough to make the final write with std::memory_order_release in C++11.
However it is not clear what to do on the executor thread side (it's out of control, we just control the machine code block we write). Shall we write some instruction barrier before the first instruction of the code block being modified?
I don't think your update procedure is safe. Unlike x86, ARM's instruction caches aren't coherent with data caches, according to this self-modifying-code blog post.
The non-jump first instruction could still be cached, so another thread could enter the block. When execution reaches the 2nd i-cache line of the block, maybe that one is re-loaded and sees the partially-modified state.
There's also another problem: an interrupt (or context switch) could lead to an evict/reload of the cache line in a thread that's still in the middle of executing the old version. Rewriting a block of instructions in-place requires you to be sure that execution in all other threads has exited the block after you modify things so that new threads won't enter it. This is an issue even with coherent I-cache (like x86), and even if the block of code fits in a single cache line.
I don't think there's any way to make rewriting in-place both safe and efficient at the same time on ARM.
Without coherent I-caches, you also can't guarantee that other threads will see code changes promptly with this design, without ridiculously expensive things like flushing blocks from L1I cache before running them every time.
With coherent I-cache (x86 style), you could just wait long enough for any possible delay in another thread finishing execution of the old version. Even if the block doesn't do any I/O or system calls, cache misses and context switches are possible. If it's running at realtime priority, especially with interrupts disabled, then worst-cache is just cache misses, i.e. not very long. Otherwise I wouldn't bet on anything less than a timeslice or two (maybe 10ms) being really safe.
These slides have a nice overview of ARM caches, mostly focusing on ARMv8.
I'm actually going to quote another slide (about virtualizing ARM) for this bullet point summary, but I'd recommend reading the ELC2016 slides, not the virtualization slides.
Software needs to be aware of caches in a few cases: Executable code loading / generation
- Requires a D-cache clean to Point of Unification + I-cache invalidation
- Possible from userspace on ARMv8
- Requires a system call on ARMv7
D-cache can be invalidated with or without write-back (so make sure you clean/flush instead of discard!). You can and should trigger this by virtual address (instead of flushing a whole cache at once, and definitely don't use the flush by set/way stuff for this).
If you didn't clean your D-cache before invalidating I-cache, code-fetch could fetch directly from main memory into non-coherent I-cache after missing in L2. (Without allocating a stale line in any unified caches, which MESI would prevent because L1D has the line in Modified state). In any case, cleaning L1D to the PoU is architecturally required, and happens in the non-perf-critical writer thread anyway, so it's probably best just to do it instead of trying to reason whether it's safe not to for a specific ARM microarchitecture. See comments for @Notlikethat's efforts to clear up my confusion on this.
For more on clearing I-cache from user-space, see How clear and invalidate ARM v7 processor cache from User Mode on Linux 2.6.35.  GCC's __clear_cache() function, and Linux sys_cacheflush only work on memory regions that were mmapped with PROT_EXEC.
Where you were planning to have whole blocks of instrumentation code, put a single indirect jump (or a save/restore of lr and a function-call if you're going to have a branch anyway).  Each block has its own jump target variable which can be updated atomically.  The key thing here is that the destination for the indirect jump is data, so it's coherent with stores from the writing thread.
Since you update the pointer atomically, consumer threads either jump to the old or new block of code.
Now your problem is making sure that no core has a stale copy of the new location in its i-cache. Given the possibilities of context switches, that includes the current core, if context switches don't totally flush the i-cache.
If you use a large enough ring buffer of locations for new blocks, such that they sit unused for long enough to be evicted, it might be impossible in practice for there to ever be a problem. This sounds incredibly hard to prove, though.
If updates are infrequent compared to how often other threads run these dynamically-modified blocks, it's probably cheap enough to have the publishing thread trigger cache-flushes in other threads after writing a new block, but before updating the indirect-jump pointer to point to it.
Forcing other threads to flush their cache:
Linux 4.3 and later has a membarrier() system call that will run a memory barrier on all other cores in the system (usually with an inter-processor interrupt) before it returns (thus barriering all threads of all processes).  See also this blog post describing some use-cases (like user-space RCU) and mprotect() as an alternative.
It doesn't appear to support flushing instruction caches, though.  If you're building a custom kernel, you could consider adding support for a new cmd or flag value that means flush instruction caches instead of (or as well as) running a memory barrier.  Perhaps the flag value could be a virtual address?  This would only work on architectures where an address fits in an int, unless you tweak the system call API to look at the full register width of flag for your new cmd, but only the int value for the existing MEMBARRIER_CMD_SHARED.
Other than hacking membarrier(), you could send signals to the consumer threads, and have their signal handlers flush an appropriate region of i-cache. That's asynchronous, so the producer thread doesn't know when it's safe to reuse the old block.
IDK if munmap()ing it would work, but it's probably more expensive than necessary (because it has to modify page tables and invalidate the relevant TLB entries).
You might be able to do something by publishing a monotonically-increasing sequence number in a shared variable (with release semantics so it's ordered wrt. instruction writes). Then consumer threads check the sequence number against a thread-local highest-seen, and invalidate i-cache if there's new stuff. This could be per-block or global.
This doesn't directly solve the problem of detecting when the last thread running an old block has left it, unless those per-thread highest-seen counters aren't actually thread-local:  Still per-thread but the producer thread can look at them.  It can scan them for the lowest sequence number in any thread, and if that's higher than the sequence number when a block was unreferenced, it can now be reused.  Be careful of false sharing: don't use a global array of unsigned long for it, because you want each thread's private variable to be in a separate cache line with other thread-local stuff.
Another possible technique: if there's only one consumer thread, the producer sets the jump target pointer to point to a block which doesn't change (so doesn't need to be i-cache flushed). That block (which runs in the consumer thread) executes a cache-flush for the appropriate line of i-cache and then modifies the jump-target pointer again, this time to point to the block that should be run every time.
With multiple consumer threads, this gets a bit clunky: maybe each consumer has its own private jump-target pointer and the producer updates all of them?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With