After reading the excellent blog post by Mike Ash "Friday Q&A 2014-05-09: When an Autorelease Isn't" on ARC, I decided to check out the details of the optimisations that ARC applies to speed up the retain/release process. The trick I'm referring to is called "Fast autorelease" in which the caller and callee cooperate to keep the returned object out of the autorelease pool. This works best in situation like the following:
- (id) myMethod {
    id obj = [MYClass new];
    return [obj autorelease];
}
- (void) mainMethod {
   obj = [[self myMethod] retain];
   // Do something with obj
   [obj release];
}
that can be optimised by skipping the autorelease pool completely:
- (id) myMethod {
    id obj = [MYClass new];
    return obj;
}
- (void) mainMethod {
   obj = [self myMethod];
   // Do something with obj
   [obj release];
}
The way this optimisation is implemented is very interesting. I quote from Mike's post:
"There is some extremely fancy and mind-bending code in the Objective-C runtime's implementation of autorelease. Before actually sending an autorelease message, it first inspects the caller's code. If it sees that the caller is going to immediately call objc_retainAutoreleasedReturnValue, it completely skips the message send. It doesn't actually do an autorelease at all. Instead, it just stashes the object in a known location, which signals that it hasn't sent autorelease at all."
So far so good. The implementation for x86_64 on NSObject.mm is quite straightforward. The code analyses the assembler located after the return address of objc_autoreleaseReturnValue for the presence of a call to objc_retainAutoreleasedReturnValue.
static bool callerAcceptsFastAutorelease(const void * const ra0)
{
    const uint8_t *ra1 = (const uint8_t *)ra0;
    const uint16_t *ra2;
    const uint32_t *ra4 = (const uint32_t *)ra1;
    const void **sym;
    //1. Navigate the DYLD stubs to get to the real pointer of the function to be called
    // 48 89 c7    movq  %rax,%rdi
    // e8          callq symbol
    if (*ra4 != 0xe8c78948) {
        return false;
    }
    ra1 += (long)*(const int32_t *)(ra1 + 4) + 8l;
    ra2 = (const uint16_t *)ra1;
    // ff 25       jmpq *symbol@DYLDMAGIC(%rip)
    if (*ra2 != 0x25ff) {
        return false;
    }
    ra1 += 6l + (long)*(const int32_t *)(ra1 + 2);
    sym = (const void **)ra1;
    //2. Check that the code to be called belongs to objc_retainAutoreleasedReturnValue
    if (*sym != objc_retainAutoreleasedReturnValue)
    {
        return false;
    }
    return true;
}
But when it comes to ARM, I just can't understand how it works. The code looks like this (I've simplified a little bit):
static bool callerAcceptsFastAutorelease(const void *ra)
{
    // 07 70 a0 e1    mov r7, r7
    if (*(uint32_t *)ra == 0xe1a07007) {
        return true;
    }
    return false;
}
It looks like the code is identifying the presence of objc_retainAutoreleasedReturnValue not by looking up the presence of a call to that specific function, but by looking instead for a special no-op operation mov r7, r7. 
Diving into LLVM source code I found the following explanation:
"The implementation of objc_autoreleaseReturnValue sniffs the instruction stream following its return address to decide whether it's a call to objc_retainAutoreleasedReturnValue. This can be prohibitively expensive, depending on the relocation model, and so on some targets it instead sniffs for a particular instruction sequence. This functions returns that instruction sequence in inline assembly, which will be empty if none is required."
I was wondering why is that so on ARM?
Having the compiler put there a certain marker so that a specific implementation of a library can find it sounds like a strong coupling between compiler and the library code. Why can't the "sniffing" be implemented the same way as on the x86_64 platform?
IIRC (been a while since I've written ARM assembly), ARM's addressing modes don't really allow for direct addressing across the full address space. The instructions used to do addressing -- loads, stores, etc... -- don't support direct access to the full address space as they are limited in bit width.
Thus, any kind of go to this arbitrary address and check for that value, then use that value to go look over there will be significantly slower on ARM as you have to use indirect addressing which involves math and... math eats CPU cycles.
By having a compiler emit a NO-OP instruction that can easily be checked, it eliminates the need for indirection through the DYLD stubs.
At least, I'm pretty sure that is what is going on. Two ways to know for sure; take the code for those two functions and compile it with -Os for x86_64 vs. ARM and see what the resulting instruction streams look like (i.e. both functions on each architecture) or wait until Greg Parker shows up to correct this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With