Looking at this benchmark about a custom std::function implementation: https://github.com/PacktPublishing/Hands-On-Design-Patterns-with-CPP-Second-Edition/blob/main/Chapter06/09_function.C
I tried to replicate the example and I noticed that despite declaring this simple function like this: __attribute__((noinline)) auto function_no_inline(int a, int b, int c, int d) -> int { return a + b + c + d; }, the time it took was the same as the inline function, while it was much more if function was actually defined in a different compilation unit. It seems that the attribute was ignored for some reason. Why? Arguments are obtained from rand().
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_invoke_function 1.35 ns 1.35 ns 504544141
BM_invoke_function_no_inline 0.271 ns 0.271 ns 2584830443
BM_invoke_function_inline 0.270 ns 0.270 ns 2580073503
BM_invoke_std_function 2.21 ns 2.17 ns 324669753
This is my code. It links against the google-benchmark library
#include <benchmark/benchmark.h>
#include <functional>
auto function(int a, int b, int c, int d) -> int;
__attribute__((noinline)) auto function_no_inline(int a, int b, int c, int d) -> int { return a + b + c + d; }
inline auto function_inline(int a, int b, int c, int d) { return a + b + c + d; }
template <typename Callable>
auto invoke(int a, int b, int c, int d, const Callable& callable)
{
return callable(a, b, c, d);
}
// Benchmarks
void BM_invoke_function(benchmark::State& state)
{
int a{rand()};
int b{rand()};
int c{rand()};
int d{rand()};
for (auto _ : state)
{
benchmark::DoNotOptimize(invoke(a, b, c, d, function));
benchmark::ClobberMemory();
}
}
void BM_invoke_function_no_inline(benchmark::State& state)
{
int a{rand()};
int b{rand()};
int c{rand()};
int d{rand()};
for (auto _ : state)
{
benchmark::DoNotOptimize(invoke(a, b, c, d, function_no_inline));
benchmark::ClobberMemory();
}
}
void BM_invoke_function_inline(benchmark::State& state)
{
int a{rand()};
int b{rand()};
int c{rand()};
int d{rand()};
for (auto _ : state)
{
benchmark::DoNotOptimize(invoke(a, b, c, d, function_inline));
benchmark::ClobberMemory();
}
}
void BM_invoke_std_function(benchmark::State& state)
{
int a{rand()};
int b{rand()};
int c{rand()};
int d{rand()};
std::function<int(int, int, int, int)> std_function{function};
for (auto _ : state)
{
benchmark::DoNotOptimize(invoke(a, b, c, d, std_function));
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_invoke_function);
BENCHMARK(BM_invoke_function_no_inline);
BENCHMARK(BM_invoke_function_inline);
BENCHMARK(BM_invoke_std_function);
BENCHMARK_MAIN();
I popped your example into Compiler Explorer (link) and I see that function_inline is inlined, but function_no_inline is indeed not:
BM_invoke_function_inline(benchmark::State&):
push r15
push r14
[...]
lea edx, [r14+r15]
add edx, ebp
add edx, DWORD PTR [rsp+12]
BM_invoke_function_no_inline(benchmark::State&):
push r15
push r14
[...]
call function_no_inline(int, int, int, int)
I'm not sure if I guessed your compilation setup correctly (e.g. -std=c++23 -O3), but either I can't reproduce your results, or the explanation does not involve noinline.
That said, noinline is kind of outdated: it prevents inlining, but it does not prevent several other kinds of optimizations that could be affecting your situation (though apparently not if we trust my Compiler Explorer results.) The more bulletproof method is to use noipa to explicitly ask GCC to treat the function as a standalone unit. It includes noinline and any other dark magic.
From GCC function attribute docs:
noinline
This function attribute prevents a function from being considered for inlining. It also disables some other interprocedural optimizations; it’s preferable to use the more comprehensive noipa attribute instead if that is your goal.
Even if a function is declared with the noinline attribute, there are optimizations other than inlining that can cause calls to be optimized away if it does not have side effects, although the function call is live. To keep such calls from being optimized away, put
asm ("");
noipa
Disable interprocedural optimizations between the function with this attribute and its callers, as if the body of the function is not available when optimizing callers and the callers are unavailable when optimizing the body. [...]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With