This is a follow-up of this SO Answer. Given a flat input range and three size_t
dimensions, the code creates a nested random_access_range
of random_access_range
s of random_access_range
s, modelling a three-dimensional array.
Quickbench
Iterating over the elements in the "multidimensional" range using a nested-for loop and indices is a bit slower than directly iterating over the elements of the input range (factor 4 slower). I suppose some performance drop can be expected, but a factor of 4 hurts a bit.
Even worse, recursively views::join
ing the multidimensional range back to a flat range and iterating over this flat range is slower by a factor of 20. Having read through the comments in this Github issue it can be expected that views::join
will come with some additional overhead, but a factor of 20 seems a bit much.
How can this large overhead with views::join
be explained? Am I using it wrong or is something fishy with my benchmark? Is there anything that can be done to speed up the code or are ranges just a poor choice for this kind of application?
The code can be found under the Quickbench link above, I will add it here for completeness:
#include <vector>
#include <ranges>
#include <cassert>
#include <iostream>
template <size_t dim>
struct Slice {
// default constructor leaves start at zero and end at dim. Correspondes to the whole dimension
constexpr Slice() = default;
// Create a slice with a single index
constexpr Slice(size_t i) : begin(i), end(i+1) {
assert( (0 <= i) && (i < dim));
}
// Create a slice with a start and an end index
constexpr Slice(size_t s, size_t e) : begin(s), end(e+1) {
assert( (0 <= s) && (s <= e) && (e < dim) );
}
size_t begin {0};
size_t end {dim};
};
// An adaptor object to interpret a flat range as a multidimensional array
template <size_t dim, size_t... dims>
struct MD {
constexpr static auto dimensions = std::make_tuple(dim, dims...);
consteval static size_t size(){
if constexpr (sizeof...(dims) > 0) {
return dim*(dims * ...);
}
else {
return dim;
}
}
// returns a multidimensional range over the elements in the flat array
template <typename Rng>
constexpr static auto slice(
Rng&& range,
Slice<dim> const& slice,
Slice<dims> const&... slices
)
{
return slice_impl(range, 0, slice, slices...);
}
template <typename Rng>
constexpr static auto slice_impl(
Rng&& range,
size_t flat_index,
Slice<dim> const& slice,
Slice<dims> const&... slices
)
{
if constexpr (std::ranges::sized_range<Rng>) { assert(std::size(range) >= size()); }
static_assert(sizeof...(slices) == sizeof...(dims), "wrong number of slice arguments.");
if constexpr (sizeof...(slices) == 0)
{
// end recursion at inner most range
return range | std::views::drop(flat_index*dim + slice.begin) | std::views::take(slice.end - slice.begin);
}
else
{
// for every index to be kept in this dimension, recurse to the next dimension and increment the flat_index
return std::views::iota(slice.begin, slice.end) | std::views::transform(
[&range, flat_index, slices...](size_t i){
return MD<dims...>::slice_impl(range, flat_index*dim + i, slices...);
}
);
}
}
// convenience overload for the full view
template <typename Rng>
constexpr static auto slice(Rng&& range){
return slice(range, Slice<dim>{}, Slice<dims>{}...);
}
};
// recursively join a range of ranges
// https://stackoverflow.com/questions/63249315/use-of-auto-before-deduction-of-auto-with-recursive-concept-based-fun
template <typename Rng>
auto flat(Rng&& rng) {
using namespace std::ranges;
auto joined = rng | views::join;
if constexpr (range<range_value_t<decltype(joined)>>) {
return flat(joined);
} else {
return joined;
}
}
Iterate over two 6x6x6 slices out of 10x10x10 arrays and add the elements of one slice to another.
// define the dimensions of a 3d-array
constexpr size_t nx{10}, ny{10}, nz{10};
// define the contents of two nx x ny x nz arrays in and out
std::vector<double> Vout(nx*ny*nz, 0.);
std::vector<double> Vin(nx*ny*nz, 1.23);
// define some slice indices for each dimension
size_t lx{0}, ly{0}, lz{0};
size_t hx{5}, hy{5}, hz{5};
auto r_in = MD<nx,ny,nz>::slice(Vin, {lx, hx}, {ly, hy}, {lz, hz});
auto r_out = MD<nx,ny,nz>::slice(Vout, {lx, hx}, {ly, hy}, {lz, hz});
for (int k=lz; k<hz; ++k)
for (int j=ly; j<hy; ++j)
for (int i=lx; i<hx; ++i)
Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];
This benchmark just tests the time of creating the two MD<2,3,2>
objects and the flat ranges without iterating over them.
// C++23: for (auto [o, i] : std::views::zip(flat(r_out), flat(r_in))) { o = i; }
auto r_in_flat = flat(r_in);
auto r_out_flat = flat(r_out);
auto o = r_out_flat.begin();
auto i = r_in_flat.begin();
for(; o != r_out_flat.end(); i++, o++){
*o += *i;
}
for (size_t x = 0; x <= hx-lx; ++x)
for (size_t y = 0; y <= hy-ly; ++y)
for (size_t z = 0; z <= hz-lz; ++z)
r_out[x][y][z] += r_in[x][y][z];
I found an issue with the benchmark: The traditional loop missed some values because I used <
in the condition of the for-loops where I should've used <=
.
for (int k=lz; k<=hz; ++k)
for (int j=ly; j<=hy; ++j)
for (int i=lx; i<=hx; ++i)
Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];
With this, the difference is a little less dramatic: The nested loop using ranges is 2x slower than the traditional loop and the loop over the joined ranges 12x slower. Still, I would have hoped for a lower penalty.
Quickbench
Motivated by @Newbies comment I ran some benchmarks using a 1x1xN array. Interestingly the first quick check showed really terrible results, where the non-joined ranges implementation was 450x slower than the native nested loop: https://quick-bench.com/q/-ZHPSTtvF4EZVg3JmuqMec4TYyU.
So I ran some tests using a 1xN array to benchmark the ranges patterns I used in the implementation:
drop_take: In the right-most dimension I simply std::views::drop
the first few elements and std::views::take
the number of elements I am looking for. This range is of the form take(drop(input_range))
. This take_drop
pattern works nicely and iterating over it is basically as fast as iterating over the input range.
iota_transform: In all other dimensions but the right-most one, I std::views::transform
the elements of std::views::iota
for each index to the range obtained from the right-neighbor dimension via recursion. So for the second-to-right dimension, we create a range of ranges of the form transform(iota, LAMBDA(take(drop(input_range))))
. The benchmark shows that this causes a doubling of calculation time (presumably because of the lack of vectorization?).
join: Not much of a "pattern" but I included a benchmark for iterating over join(transform(iota, LAMBDA(take(drop(input_range)))))
. The performance drops again by a factor of 5.5.
Quickbench
So maybe the iota_transform pattern is an anti-pattern? Using std::views::iota
to construct a range of ranges based on a list of indices seemed canonical to me, though the developers probably didn't have ranges as the output of std::views::transform
in mind. The actual range I want to iterate over is in the lambda expression passed to the transform, so maybe this a hard barrier for the compiler's optimizations?
But even then, it leaves the question unanswered why std::views::join
should be so much slower. I can't fathom why it would require 5.5 times as much calculation time.
I think you may find the answer to this in the talk "Multidimensional C++ - Bryce Adelstein Lelbach - CppNorth 2022" from around 30 - 45 minutes in.
Peter: Fair enough, I'll give a full answer:
In summary, most optimizers are not able to vectorize the flattened/joined iterator. He shows several approaches where the iterator either contains the current MD indices, one where it only contains the linear index (must then reconstruct the MD indices which is even worse), and one approach with coroutines, which works better on some compilers - so it requires a good coroutine implementation + compiler support.
The supposedly bad performance led to the conclusion that C++23 std::mdspan is not going to have flattened iterator support, which I think was unwise, but I guess it can be added in the future.
I happened to recently implement a multi-span type in pure C, and I got that the flattened iterator is around half the speed of the nested. This is what I would expect - and well worth it given the benefits of such a feature.
Here's a godbolt link that shows that flattened iteration performance can be good (in C at least).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With