In my tests I'm seeing the performance cost of unmanaged to managed interop double when compiling for x64 instead of x86. What is causing this slowdown?
I'm testing release builds not running under the debugger. The loop is 100,000,000 iterations.
In x86 I'm measuring an average of 8ns per interop call, which seems to match what I've seen in other places. Unity's x86 interop is 8.2ns. A Microsoft article and Hans Passant both mention 7ns. 8ns is 28 clock cycles on my machine which seems at least reasonable, though I do wonder if it's possible to go faster.
In x64 I'm measuring an average of 17ns per interop call. I can't find anyone mentioning a difference between x86 and x64, or even mentioning which they are referring to when giving times. Unity's x64 interop clocks in around 5.9ns.
Regular function calls (including into an unmanaged C++ DLL) cost an average of 1.3ns. This doesn't change significantly between x86 and x64.
Below is my minimal C++/CLI code for measuring this, though I'm seeing the same numbers in my actual project that consists of a native C++ project calling into the managed side of a C++/CLI DLL.
#pragma managed
void
ManagedUpdate()
{
}
#pragma unmanaged
#include <wtypes.h>
#include <cstdint>
#include <cwchar>
struct ProfileSample
{
    static uint64_t frequency;
    uint64_t startTick;
    wchar_t* name;
    int count;
    ProfileSample(wchar_t* name_, int count_)
    {
        name = name_;
        count = count_;
        LARGE_INTEGER win32_startTick;
        QueryPerformanceCounter(&win32_startTick);
        startTick = win32_startTick.QuadPart;
    }
    ~ProfileSample()
    {
        LARGE_INTEGER win32_endTick;
        QueryPerformanceCounter(&win32_endTick);
        uint64_t endTick = win32_endTick.QuadPart;
        uint64_t deltaTicks = endTick - startTick;
        double nanoseconds = (double) deltaTicks / (double) frequency * 1000000000.0 / count;
        wchar_t buffer[128];
        swprintf(buffer, _countof(buffer), L"%s - %.4f ns\n", name, nanoseconds);
        OutputDebugStringW(buffer);
        if (!IsDebuggerPresent())
            MessageBoxW(nullptr, buffer, nullptr, 0);
    }
};
uint64_t ProfileSample::frequency = 0;
int CALLBACK
WinMain(HINSTANCE, HINSTANCE, PSTR, INT)
{
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);
    ProfileSample::frequency = frequency.QuadPart;
    //Warm stuff up
    for ( size_t i = 0; i < 100; i++ )
        ManagedUpdate();
    const int num = 100000000;
    {
        ProfileSample p(L"ManagedUpdate", num);
        for ( size_t i = 0; i < num; i++ )
            ManagedUpdate();
    }
    return 0;
}
1) Why does x64 interop cost 17ns when x86 interop costs 8ns
2) Is 8ns the fastest I can reasonably expect to go?
Edit 1
Additional information
CPU i7-4770k @ 3.5 GHz
Test case is a single C++/CLI project in VS2017.
Default Release configuration
Full optimization /O2
I've randomly played with settings like Favor Size or Speed, Omit Frame Pointers, Enable C++ Exceptions, and Security Check and none appear to change the x86/x64 discrepancy.
Edit 2
I've stepped through the disassembly (not something I'm very familiar with at this point).
In x86 is seem something along the lines of
call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
jmp     _IJWNOADThunkJumpTarget@0
In x64 I see
call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
        //Some jumping around that quickly leads to IJWNOADThunk::MakeCall:
call    IJWNOADThunk::FindThunkTarget
        //MakeCall uses the result from FindThunkTarget to jump into UMThunkStub:
FindThunkTarget is pretty heavy and it looks like most of the time is being spent there. So my working theory is that in x86 the thunk target is known and execution can more or less jump straight to it. But in x64 the thunk target is not known and a search process takes place to find it before being able to jump to it. I wonder why that is?
I have no recollection of ever giving a perf guarantee on code like this. 7 nanoseconds is the kind of perf you can expect on C++ Interop code, managed code calling native code. This is going the other way around, native code calling managed code, aka "reverse pinvoke".
You are definitely getting the slow flavor of this kind of interop. The "No AD" in IJWNOADThunk is the nasty little detail as far as I can see. This code did not get the micro-optimization love that is common in interop stubs. It is also highly specific to C++/CLI code. Nasty because it cannot assume anything about the AppDomain in which the managed code needs to run. In fact, it cannot even assume that the CLR is loaded and initialized.
Is 8ns the fastest I can reasonably expect to go?
Yes. You are in fact on the very low end with this measurement. Your hardware is a lot beefier than mine, I'm testing this on a mobile Haswell. I'm seeing between ~26 and 43 nanosec for x86, between ~40 and 46 nanosec for x64. So you are getting x3 better times, pretty impressive. Frankly, a bit too impressive but you are seeing the same code that I do so we must be measuring the same scenario.
Why does x64 interop cost 17ns when x86 interop costs 8ns?
This is not optimal code, the Microsoft programmer was very pessimistic about what corners he could cut. I have no real insight whether that was warranted, the comments in UMThunkStub.asm don't explain anything about choices.
There is not anything particularly special about reverse pinvoke. Happens all the time in, say, a GUI program that processes Windows messages. But that is done very differently, such code uses a delegate. Which is the way to get ahead and make this faster. Using Marshal::GetFunctionPointerForDelegate() is the key. I tried this approach:
using namespace System;
using namespace System::Runtime::InteropServices;
void* GetManagedUpdateFunctionPointer() {
    auto dlg = gcnew Action(&ManagedUpdate);
    auto tobereleased = GCHandle::Alloc(dlg);
    return Marshal::GetFunctionPointerForDelegate(dlg).ToPointer();
}
And used like this in the WinMain() function:
typedef void(__stdcall * testfuncPtr)();
testfuncPtr fptr = (testfuncPtr)GetManagedUpdateFunctionPointer();
//Warm stuff up
for (size_t i = 0; i < 100; i++) fptr();
    //...
    for ( size_t i = 0; i < num; i++ ) fptr();
Which made the x86 version a little faster. And the x64 version just as fast.
If you are going to use this approach then keep in mind that an instance method as the delegate target is faster than a static method in x64 code, the call stub has less work to do to rearrange the function arguments.  And do beware I took a shortcut on the tobereleased variable, there is a possible memory management detail here and a GCHandle::Free() call might be preferred or necessary in a plug-in scenario.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With