Why is iterating over std::ranges::views::join so slow

Question

This is a follow-up of this SO Answer. Given a flat input range and three size_t dimensions, the code creates a nested random_access_range of random_access_ranges of random_access_ranges, modelling a three-dimensional array.

enter image description here Quickbench

Iterating over the elements in the "multidimensional" range using a nested-for loop and indices is a bit slower than directly iterating over the elements of the input range (factor 4 slower). I suppose some performance drop can be expected, but a factor of 4 hurts a bit.

Even worse, recursively views::joining the multidimensional range back to a flat range and iterating over this flat range is slower by a factor of 20. Having read through the comments in this Github issue it can be expected that views::join will come with some additional overhead, but a factor of 20 seems a bit much.

How can this large overhead with views::join be explained? Am I using it wrong or is something fishy with my benchmark? Is there anything that can be done to speed up the code or are ranges just a poor choice for this kind of application?

Implementation

The code can be found under the Quickbench link above, I will add it here for completeness:

#include <vector>
#include <ranges>
#include <cassert>
#include <iostream>

template <size_t dim>
struct Slice {
    // default constructor leaves start at zero and end at dim. Correspondes to the whole dimension
    constexpr Slice() = default;

    // Create a slice with a single index
    constexpr Slice(size_t i) : begin(i), end(i+1) {
        assert( (0 <= i) && (i < dim));
    }

    // Create a slice with a start and an end index
    constexpr Slice(size_t s, size_t e) : begin(s), end(e+1) {
        assert( (0 <= s) && (s <= e) && (e < dim) );
    } 

    size_t begin {0};
    size_t end {dim};
};

// An adaptor object to interpret a flat range as a multidimensional array
template <size_t dim, size_t... dims>
struct MD {
    constexpr static auto dimensions = std::make_tuple(dim, dims...);
    consteval static size_t size(){
        if constexpr (sizeof...(dims) > 0) {
            return dim*(dims * ...);
        }
        else {
            return dim;
        }
    }

    // returns a multidimensional range over the elements in the flat array
    template <typename Rng>
    constexpr static auto slice(
        Rng&& range, 
        Slice<dim> const& slice, 
        Slice<dims> const&... slices
    )
    {        
        return slice_impl(range, 0, slice, slices...);
    }

    template <typename Rng>
    constexpr static auto slice_impl(
        Rng&& range, 
        size_t flat_index,  
        Slice<dim> const& slice, 
        Slice<dims> const&... slices
    )
    {
        if constexpr (std::ranges::sized_range<Rng>) { assert(std::size(range) >= size());  }
        static_assert(sizeof...(slices) == sizeof...(dims), "wrong number of slice arguments.");

        if constexpr (sizeof...(slices) == 0) 
        {
            // end recursion at inner most range
            return range | std::views::drop(flat_index*dim + slice.begin) | std::views::take(slice.end - slice.begin);
        }
        else 
        {
            // for every index to be kept in this dimension, recurse to the next dimension and increment the flat_index
            return std::views::iota(slice.begin, slice.end) | std::views::transform(
                [&range, flat_index, slices...](size_t i){
                    return MD<dims...>::slice_impl(range, flat_index*dim + i, slices...);
                }
            );
        }
    }

    // convenience overload for the full view
    template <typename Rng>
    constexpr static auto slice(Rng&& range){
        return slice(range, Slice<dim>{}, Slice<dims>{}...);
    }


};

// recursively join a range of ranges
// https://stackoverflow.com/questions/63249315/use-of-auto-before-deduction-of-auto-with-recursive-concept-based-fun
template <typename Rng>
auto flat(Rng&& rng) {
    using namespace std::ranges;

    auto joined = rng | views::join;    
    if constexpr (range<range_value_t<decltype(joined)>>) {
        return flat(joined);
    } else {
        return joined;
    }
}

Test case

Iterate over two 6x6x6 slices out of 10x10x10 arrays and add the elements of one slice to another.

// define the dimensions of a 3d-array
constexpr size_t nx{10}, ny{10}, nz{10};

// define the contents of two nx x ny x nz arrays in and out
std::vector<double> Vout(nx*ny*nz, 0.);
std::vector<double> Vin(nx*ny*nz, 1.23);

// define some slice indices for each dimension
size_t lx{0}, ly{0}, lz{0};
size_t hx{5}, hy{5}, hz{5};

auto r_in = MD<nx,ny,nz>::slice(Vin, {lx, hx}, {ly, hy}, {lz, hz});
auto r_out = MD<nx,ny,nz>::slice(Vout, {lx, hx}, {ly, hy}, {lz, hz});

Traditional for-loop

for (int k=lz; k<hz; ++k)
    for (int j=ly; j<hy; ++j)
        for (int i=lx; i<hx; ++i)
            Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];

MDRanges setup

This benchmark just tests the time of creating the two MD<2,3,2> objects and the flat ranges without iterating over them.

Loop over flattened/joined ranges

// C++23: for (auto [o, i] : std::views::zip(flat(r_out), flat(r_in))) { o = i; }

auto r_in_flat = flat(r_in);
auto r_out_flat = flat(r_out);

auto o = r_out_flat.begin();
auto i = r_in_flat.begin();
for(; o != r_out_flat.end(); i++, o++){
    *o += *i;
}

Nested loop using ranges

for (size_t x = 0; x <= hx-lx; ++x)
    for (size_t y = 0; y <= hy-ly; ++y)
        for (size_t z = 0; z <= hz-lz; ++z)
            r_out[x][y][z] += r_in[x][y][z];

Edit 1:

I found an issue with the benchmark: The traditional loop missed some values because I used < in the condition of the for-loops where I should've used <=.

for (int k=lz; k<=hz; ++k)
  for (int j=ly; j<=hy; ++j)
      for (int i=lx; i<=hx; ++i)
          Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];

With this, the difference is a little less dramatic: The nested loop using ranges is 2x slower than the traditional loop and the loop over the joined ranges 12x slower. Still, I would have hoped for a lower penalty.

enter image description here Quickbench

Edit 2:

Motivated by @Newbies comment I ran some benchmarks using a 1x1xN array. Interestingly the first quick check showed really terrible results, where the non-joined ranges implementation was 450x slower than the native nested loop: https://quick-bench.com/q/-ZHPSTtvF4EZVg3JmuqMec4TYyU.

So I ran some tests using a 1xN array to benchmark the ranges patterns I used in the implementation:

drop_take: In the right-most dimension I simply std::views::drop the first few elements and std::views::take the number of elements I am looking for. This range is of the form take(drop(input_range)). This take_drop pattern works nicely and iterating over it is basically as fast as iterating over the input range.

iota_transform: In all other dimensions but the right-most one, I std::views::transform the elements of std::views::iota for each index to the range obtained from the right-neighbor dimension via recursion. So for the second-to-right dimension, we create a range of ranges of the form transform(iota, LAMBDA(take(drop(input_range)))). The benchmark shows that this causes a doubling of calculation time (presumably because of the lack of vectorization?).

join: Not much of a "pattern" but I included a benchmark for iterating over join(transform(iota, LAMBDA(take(drop(input_range))))). The performance drops again by a factor of 5.5.

enter image description here Quickbench

So maybe the iota_transform pattern is an anti-pattern? Using std::views::iota to construct a range of ranges based on a list of indices seemed canonical to me, though the developers probably didn't have ranges as the output of std::views::transform in mind. The actual range I want to iterate over is in the lambda expression passed to the transform, so maybe this a hard barrier for the compiler's optimizations?

But even then, it leaves the question unanswered why std::views::join should be so much slower. I can't fathom why it would require 5.5 times as much calculation time.

tylo · Accepted Answer

I think you may find the answer to this in the talk "Multidimensional C++ - Bryce Adelstein Lelbach - CppNorth 2022" from around 30 - 45 minutes in.

Peter: Fair enough, I'll give a full answer:

In summary, most optimizers are not able to vectorize the flattened/joined iterator. He shows several approaches where the iterator either contains the current MD indices, one where it only contains the linear index (must then reconstruct the MD indices which is even worse), and one approach with coroutines, which works better on some compilers - so it requires a good coroutine implementation + compiler support.

The supposedly bad performance led to the conclusion that C++23 std::mdspan is not going to have flattened iterator support, which I think was unwise, but I guess it can be added in the future.

I happened to recently implement a multi-span type in pure C, and I got that the flattened iterator is around half the speed of the nested. This is what I would expect - and well worth it given the benefits of such a feature.

Here's a godbolt link that shows that flattened iteration performance can be good (in C at least).

Why is iterating over std::ranges::views::join so slow

Tags:

c++

benchmarking

c++20

std-ranges

Implementation

Test case

Traditional for-loop

MDRanges setup

Loop over flattened/joined ranges

Nested loop using ranges

Edit 1:

Edit 2:

joergbrech

1 Answers

tylo

Recent Activity

Donate For Us

Why is iterating over std::ranges::views::join so slow

Tags:

c++

benchmarking

c++20

std-ranges

Implementation

Test case

Traditional for-loop

MDRanges setup

Loop over flattened/joined ranges

Nested loop using ranges

Edit 1:

Edit 2:

joergbrech

1 Answers

tylo

Related questions

Recent Activity

Donate For Us