Assume that you have to implement a static_vector<T, N> class, which is a fixed capacity container that entirely lives on the stack and never allocates, and exposes an std::vector-like interface. (Boost provides boost::static_vector.)
Considering that we must have uninitialized storage for maximum N instances of T, there are multiple choices that can be made when designing the internal data layout:
Single-member union:
union U { T _x; };
std::array<U, N> _data;
Single std::aligned_storage_t:
std::aligned_storage_t<sizeof(T) * N, alignof(T)> _data;
Array of std::aligned_storage_t:
using storage = std::aligned_storage_t<sizeof(T), alignof(T)>;
std::array<storage, N> _data;
Regardless of the choice, creating the members will require the use of "placement new" and accessing them will require something along the lines of reinterpret_cast.
Now assume that we have two very minimal implementations of static_vector<T, N>:
with_union: implemented using the "single-member union" approach;
with_storage: implemented using the "single std::aligned_storage_t" approach.
Let's perform the following benchmark using both g++ and clang++ with -O3. I used quick-bench.com for this task:
void escape(void* p) { asm volatile("" : : "g"(p) : "memory"); }
void clobber() { asm volatile("" : : : "memory"); }
template <typename Vector>
void test()
{
for(std::size_t j = 0; j < 10; ++j)
{
clobber();
Vector v;
for(int i = 0; i < 123456; ++i) v.emplace_back(i);
escape(&v);
}
}
(escape and clobber are taken from Chandler Carruth's CppCon 2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!")
g++ 7.2 (live here):
clang++ 5.0 (live here):
As you can see from the results, g++ seems to be able to aggressively optimize (vectorization) the implementation that uses the "single std::aligned_storage_t" approach, but not the implementation using the union.
My questions are:
Is there anything in the Standard that prevents the implementation using union from being aggressively optimized? (I.e. does the Standard grant more freedom to the compiler when using std::aligned_storage_t - if so, why?)
Is this purely a "quality of implementation" issue?
xskxzr is right, this is the same issue as in this question. Fundamentally, gcc is missing an optimization opportunity in forgetting that std::array's data is aligned. John Zwinck helpfully reported bug 80561.
You can verify this in your benchmark by making one of two changes to with_union:
Change _data from a std::array<U, N> to simply a U[N]. Performance becomes identical
Remind gcc that _data is actually aligned by changing the implementation of emplace_back() to:
template <typename... Ts>
T& emplace_back(Ts&&... xs)
{
U* data = static_cast<U*>(__builtin_assume_aligned(_data.data(), alignof(U)));
T* ptr = &data[_size++]._x;
return *(new (ptr) T{std::forward<Ts>(xs)...});
}
Either of those changes with the rest of your benchmark gets me comparable results between WithUnion and WithAlignedStorage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With