There is a loop.
for (int i = 0; i < n; ++i) {
    //...
    v[i] = o.f(i);
    //...
}
Each v[i] = o.f(i) is independent of all the other v[i] = o.f(i).n can be any value and it may not be a multiple of the number of cores. What is the simplest way to use all the cores to do this?
Yes, a single process can run multiple threads on different cores. Caching is specific to the hardware. Many modern Intel processors have three layers of caching, where the last level cache is shared across cores.
The ExecutionPolicy overloads of the algorithms in <algorithm> exist for this purpose. std::transform applies a function to each element of a source range to assign to a destination range.
v.begin() is an acceptable destination, so long as v is of appropriate size. Your snippet assumes this when it uses v[i], so I will too.
We then need an iterator that gives the values [0, n) as our source, so boost::counting_iterator<int>.
Finally we need a Callable that will apply o.f to our values, so lets capture o in a lambda.
#include <algorithm>
#include <execution>
#include <boost/iterator/counting_iterator.hpp>
// assert(v.size() >= n)
std::transform(std::execution::par, boost::counting_iterator<int>(0), boost::counting_iterator<int>(n), v.begin(), [&o](int i){ return o.f(i); });
If o.f does not perform any "vectorization-unsafe operations", you are able to use std::execution::par_unseq, which may interleave calls on the same thread (i.e. unroll the loop and use SIMD instructions)
In the land of existing compilers, and remembering that M/S can't even get this stuff right for C++11, never mind about C++17/20, the C++11 answer goes something like:
typedef v.value_type R;
std::vector< std::future<R> > fut(n);
for (int i=0; i<n; i++)
    fut[i] = std::async(std::launch::async, O::f, o, i);
for (auto& f : fut)
    v.push_back(f.get());
@arne suggests we can do better by throttling the number of tasks by considering the number of processors (P), which is true, though the above code will give you a clear indication on whether you will really benefit from multi-threading the method f. Given we only want to launch X jobs simultaneously, where X is > P, < 3*P depending on the variation in job complexity (note I am relying on a signed index):
typedef v.value_type R;
std::vector< std::future<R> > fut(n);
for (ssize_t i=0, j=-X; j<n; i++,j++)
{
    if (i<n)    fut[i] = std::async(std::launch::async, O::f, o, i);
    if (j>=0)   v.push_back(fut[j].get());
}
I'm not claiming the above code is "great", but if the jobs are complex enough for us to need multithreading, the cost of looping a few extra times isn't gointg to be noticed. You will notice that if X > n the loop will spin a few times in the middle, but will produce the correct result :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With