I'm trying to allocate some memory in linux with sys_brk syscall. Here is what I tried:
BYTES_TO_ALLOCATE equ 0x08
section .text
global _start
_start:
mov rax, 12
mov rdi, BYTES_TO_ALLOCATE
syscall
mov rax, 60
syscall
The thing is as per linux calling convention I expected the return value to be in rax register (pointer to the allocated memory). I ran this in gdb and after making sys_brk syscall I noticed the following register contents
Before syscall
rax 0xc 12
rbx 0x0 0
rcx 0x0 0
rdx 0x0 0
rsi 0x0 0
rdi 0x8 8
After syscall
rax 0x401000 4198400
rbx 0x0 0
rcx 0x40008c 4194444 ; <---- What does this value mean?
rdx 0x0 0
rsi 0x0 0
rdi 0x8 8
I do not quite understand the value in the rcx register in this case. Which one to use as a pointer to the beginning of 8 bytes I allocated with sys_brk?
The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.
Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.
The other interesting part of your question is:
I do not quite understand the value in the rcx register in this case
You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.
syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.
It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)
Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:
sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).
masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.
Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).
jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.
It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.
So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With