I want to make code below parallelized:
for(int c=0; c<n; ++c) {
    Work(someArray, c);
}
I've done it this way:
#include <thread>
#include <vector>
auto iterationsPerCore = n/numCPU;
std::vector<std::future<void>> futures;
for(auto th = 0; th < numCPU; ++th) {
    for(auto n = th * iterationsPerCore; n < (th+1) * iterationsPerCore; ++n) {
        auto ftr = std::async( std::launch::deferred | std::launch::async,
            [n, iterationsPerCore, someArray]()
            {
                for(auto m = n; m < n + iterationsPerCore; ++m)
                    Work(someArray, m);
            }
        );
        futures.push_back(std::move(ftr));
    }
    for(auto& ftr : futures)
        ftr.wait();
}
// rest of iterations: n%iterationsPerCore
for(auto r = numCPU * iterationsPerCore; r < n; ++r)
    Work(someArray, r);
Problem is that it runs only 50% faster on Intel CPU, while on AMD it does 300% faster. I run it on three Intel CPUs (Nehalem 2core+HT, Sandy Bridge 2core+HT, Ivy Brigde 4core+HT). AMD processor is Phenom II x2 with 4 cores unlocked. On 2-core Intel processor it runs 50% faster with 4 threads. On 4-core, it runs 50% faster also on 4 threads. I'm testing with VS2012, Windows 7.
When I try it with 8 threads, it is 8x slower than serial loop on Intel. I suppose it is caused by HT.
What do you think about it? What's the reason of such behavior? Maybe code is not correct?
I'd suspect false sharing. This is what happens when two variables share the same cache line. Effectively, all operations on them have to be very expensively synchronized even if they are not accessed concurrently, as the cache can only operate in terms of cache lines of a certain size, even if your operations are more fine-grained. I would suspect that the AMD hardware is simply more resilient or has a different hardware design to cope with this.
To test, change the code so that each core only works on chunks which are multiples of 64bytes. This should avoid any cache line sharing, as the Intel CPUs only have a cache line of 64bytes.
I would say you need to change your compiler settings to make all the compiled code minimize the number of branches. The two different CPU styles have different operation look-ahead setups. You need to change the compiler optimization settings to match the target CPU, not the CPU upon which the code is compiled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With