I have seen online a few times it has been mentioned that C++ can be ever faster using templates.
Could someone explain, including at a low level why this is exactly? I always presumed such a "nice" feature would have overhead like most useful concepts.
I am really intrigued by this from a ultra low latency perspective!
A common example is sorting.
In C, qsort takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.
In C++, std::sort is a template, and it can take a functor object as comparator. There is a different copy of std::sort for each different type used as a comparator. Assuming you use a functor class with overloaded operator(), then the call to the comparator can easily be inlined into this copy of std::sort.
So, templates give you more inlining because there are more copies of the sort code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort running faster than an equivalent qsort. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.
In principle there's no reason why a C implementation can't inline qsort into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort. But in practice, compilers tend not to take the first step, inlining qsort. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort is called in different places with the same comparator.
So, general-purpose functions like qsort in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.
It's worth noting that templates are not by any means just about performance. std::sort is itself more general-purpose than qsort in some ways. For example qsort only sorts arrays, whereas std::sort can sort anything that provides a random-access iterator. It can for example sort a deque, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.
[*] another example with sorting - qsort takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy or similar with the value of this variable. std::sort knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort with a literal value for the size, and that call to qsort was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.
"faster" depends on what you compare it to.
Templates are fully evaluated by the compiler, and so they have zero overhead at runtime. Calling Foo<int>() is exactly as efficient as calling FooInt().
So compared to approaches which rely on more work being done at runtime, for example by calling virtual functions, templates can indeed be faster. Compared to hand-written code written precisely for that scenario, there is zero difference.
So the nice thing about templates isn't that they are "faster" than what you could do otherwise, but that they are "as fast" as hand-written code, while also being generic and reusable.
Another remarkable example of using templates to improve runtime performance is the Blitz++ numerics library. It pioneered the use of so-called expression templates, using compile-time logic to transform arithmetic expressions involving large vectors and matrices into equivalent ones that are much easier to compile to efficient machine code. For instance, given the following pseudocode:
vector<1000> a = foo(), b = bar(), c = baz(), result;
result = a + b + c;
A naive approach would add each elemnt of a and b together, store the result in a temporary vector, then do the same with c, and finally copy the result into result. Using expression template magic, the resulting code will instead be equivalent to this:
for(int i = 0; i < 1000; ++i) {
    result[i] = a[i] + b[i] + c[i];
}
This is much faster, making better use of cache locality and avoiding unnecessary temporaries along the way. It also avoids aliasing problems, where the compiler cannot prove that two pointers point to distinct memory areas, forcing it to produce unoptimal code. Expression templates are now commonly used in high-performance numerics, as well as having other uses not involving performance, such as the Boost.Spirit parsing library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With