How can I ensure memory is allocated near an address?

Question

I am writing a JIT recompiler for x86_64, and at times the emitted code needs to call a function from the compiled binary.

Due to ASLR, the .text segment of my program is placed at some random address.

Can I allocate my JIT's code cache of 32 MiB in such a way where its address is always close enough to the .text segment that I can safely emit relative calls instead of absolute calls? I am in Linux.

Peter Cordes · Accepted Answer

Near duplicate: Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code - my answer suggests allocating nearby to enable jmp/call rel32, and suggests some ways to do that.

One good method to do that is to use an array in the BSS, instead of dynamically allocating. Since you have a known size limit that's not huge (32 MiB) for your JIT region, that's perfect.

#include <sys/mman.h>
#include <stdalign.h>  // for alignas instead of _Alignas, if you're not using C23

// page aligned to make sure it doesn't share a page with anything else
// And for easy use of mprotect (or Windows VirtualProtect)
alignas(4096) unsigned char JIT_buf[32ULL * 1024*1024];

void jit_init(){
   // 31 MiB of code, for example, leaving 1 MiB of data READ|WRITE without exec
   mprotect(JIT_buf, 31ULL * 1024*1024, PROT_READ|PROT_WRITE|PROT_EXEC);
}

In the standard code models for x86-64, all static code+data for an executable or shared object is within +-2GiB of each other, making anywhere reachable from anywhere with rel32 addressing. In the non-PIE code model, it goes in the low 2GiB of virtual addresss-space (so mmap(MAP_32BIT) would work). But modern PIE executables and shared libraries get mapped outside the low 2GiB.

If your size isn't so small but has an upper bound below a couple GiB, you might still use an array at the end of the BSS. Perhaps with __attribute__((section(".bss.jit")) and use a custom linker script if necessary to get that linked at the end of the BSS, if the default one doesn't match all .bss* names. That avoids hurting dTLB locality for other global variables. In runs where you never use more than say 100 MiB, it's basically as if the array was only 100 MiB, and the part you use is contiguous with other globals.

Memory you never touch doesn't really cost anything since Linux does overcommit, like from mmap or the BSS. (It's zeroes lazily by the kernel on first access, in the page fault handler.)

If this is in a shared library, make it static, or __attribute__((hidden)) (perhaps by making that the default), so it's truly in this shared library's BSS, not the main executable's.

If this isn't a shared library, another option is to allocate memory with sbrk, but only if you never use malloc / new, or avoid letting them use sbrk. That will give you memory near the end of the BSS (of the main executable). Not quite contiguous: ASLR randomizes the break, but on my system (Linux kernel 6.7.2) the distance from a global variable to an sbrk allocation varies from 1 to 31 MiB (same on Godbolt), which still easily puts it within +-2GiB rel32 range of static code/data.

But this is maybe not as future-proof or portable; future or other current systems could randomly put the break far away from static code/data. If you do this, you should include a check so you can tell the user to report what system doesn't work the way you expected.

Glibc malloc uses sbrk for small allocations. According to @zwol on Why using both malloc/calloc/realloc and brk functions will results in undefined behavior? , "The implementation of malloc may assume that no code other than itself calls sbrk with a nonzero argument." It might for example think it owns all the memory from its last call to the new break, which would include your allocation. Or more subtle corruption of its bookkeeping. I don't know for sure that glibc malloc would actually break, but the relevant standards (probably POSIX) doesn't require it to work.

If you tune Glibc malloc to never use brk / sbrk, you'd probably be ok. By default it only uses mmap for large allocations, so it can give them back to the OS immediately on free, even if other allocations were made in the meantime. Or use a different malloc library that never uses sbrk at all. I've heard that glibc's malloc isn't wonderful and third-party mallocs can be more performant for some use-cases.

Note that G++ new/delete internally uses the same allocator as malloc.

As always, don't forget to call __builtin___clear_cache(&JIT_buf[start], &JIT_buf[end]) in GNU C after you store some bytes to it, before you call a function pointer that uses that array as machine code. On x86 it doesn't actually clear cache, just blocks optimizations like dead-store elimination (example: https://godbolt.org/z/5671x3MYn). That probably won't be a problem since calling mprotect means a pointer to the buffer has "escaped" into code the optimizer can't see, but it's a good idea to do that between storing code and calling it. If nothing else, it might ease porting to architectures that do need something in asm (most non-x86). See How to get c code to execute hex machine code? for more details.

If none of that works, you might put an array of function pointers somewhere in your JIT buffer, as "static" data that code there can access with RIP+rel32 addressing modes like call [rel32]. Or if you pass your functions a register pointing to some address, call [reg+disp8] or call [reg+disp32].

I think (or hope) this is what Daniel A. White was trying to suggest when he mentioned a vtable. Normal C++ vtables imply multiple levels of indirection starting with a pointer stored in an object. You don't want that, just an array of function pointers. The GOT would be a better analogy; call [rip+rel32] with the GOT entry is how library function calls compile with gcc -fno-plt.

Or as a one-off, mov rax, imm64 / call rax always works, but is indeed somewhat less efficient than a direct call. Call an absolute pointer in x86 machine code. Depending on the surrounding code and how often its run, it might be worse than call [rel32] with a function pointer that has to get loaded as data. If a couple cache lines of function pointers are frequently used so they stay hot in cache, that's probably better than a 10-byte movabs.

Of course direct call rel32 is always the most efficient; smallest code-size and it has the address available to the front-end earliest in case the BTB didn't predict the existence and destination of the branch before it fully decoded. (Or for the indirect branches, before it executed.)

It's probably hard to microbenchmark the difference caused by fewer indirect branches to predict, since that will most likely only matter across a whole large program. Any microbenchmark would have everything predicting perfectly. Better front-end throughput from direct calls might or might not make a measurable difference, depending on what you're JITing them into.

How can I ensure memory is allocated near an address?

Tags:

c

memory-management

linux

jit

x86-64

Offtkp

1 Answers

Peter Cordes

Recent Activity

Donate For Us

How can I ensure memory is allocated near an address?

Tags:

c

memory-management

linux

jit

x86-64

Offtkp

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us