I have a found myself most confused by the result of my tests regarding effects of different scheduling for #pragma omp parallel for. Basically my confusion sprouted from the following problem:
I have an uneven linearly increasing workload loop iterating through array of chars (relevant due to memory access and cache size). For that reason I would expect schedule(dynamic, 64) to greatly outperform schedule(static, 4) due to sheer number of times that a process has to load a value from non-cache memory.. but it does not. The program is run on a PC with 16 available threads (the same situation happens on a laptop with 6 threads) and -03 flag is used for compilation. For reference here is partial result of lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 7700X 8-Core Processor
Virtualization: AMD-V
Hypervisor vendor: Microsoft
Virtualization type: full
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 8 MiB (8 instances)
L3: 32 MiB (1 instance)
I have naturally skipped past a lot of details because Im unsure which ones could potentially be the cause of forementioned situation, thats why I will just share the entirety of the code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#define bool char
#define true 1
#define false 0
int main(int argc, char* argv[]){
int max = 100000000;
int n_array[3] = {max, max, max/2}, m_array[3] = {2, max/2, 2};
int n = max, m = 2;
// creates an array of primes in range of <m, n> and fills it with true (1)
bool *result = (bool*)malloc((n-m+1)*sizeof(bool));
memset(result, true, (n-m+1)*sizeof(bool));
// creates an array of primes in range of <2, sqrt(n)> and fills it with true (1)
bool *prime_array = (bool*)malloc((sqrt(n)+1)*sizeof(bool));
memset(prime_array, true, (sqrt(n)+1)*sizeof(bool));
// changes any non prime numbers position value to 0
double n_root = sqrt(n);
for(int i = 2; i <= n_root; ++i){
for(int j = 2; j*j <= i; ++j){
if(prime_array[j] && !(i % j)){
prime_array[i] = false;
break;
}
}
}
for(int k = 0; k < 3; ++k){
n = n_array[k], m = m_array[k];
memset(result, true, (n-m+1)*sizeof(bool));
int total_primes;
double avg_time = 0;
for(int r = 0; r < 5; ++r){
total_primes = n-m+1;
double stime = omp_get_wtime();
#pragma omp parallel
{
int local_primes = 0;
#pragma omp for schedule(dynamic, 64) // this is the scheduling selection in question
for(int i = m; i <= n; ++i){
for(int j = 2; j*j <= i; ++j){ // loop has uneven workload because each j loop iterations are dependant on the value of i
if(prime_array[j] && !(i % j)){
result[i-m] = false;
++local_primes;
break;
}
}
}
#pragma omp atomic
total_primes -= local_primes;
}
double etime = omp_get_wtime();
avg_time += etime - stime;
}
avg_time /= 5;
double mns = (n-m)/avg_time/1000000; // estimates how many numbers are checked per second
printf("w zakresie %d-%d przy %fMN/s znaleziono %d liczb pierwszych\n", m, n, mns, total_primes);
}
}
Edit:
I should have clarified that the uneven workload statement relates to schedule(static) distribution. Low chunk values for static will indeed minimize that problem to nearly nothing. Whats interesting though is after a lot more testing I have found guided to work best. Here is a rundown:
default resulted in 25.424178MN/s (safecheck: 5761455)
guided resulted in 40.884012MN/s (safecheck: 5761455)
static with chunk 1 resulted in 22.260211MN/s (safecheck: 5761455)
static with chunk 64 resulted in 28.801939MN/s (safecheck: 5761455)
static with chunk 256 resulted in 28.527325MN/s (safecheck: 5761455)
static with chunk 2048 resulted in 29.137044MN/s (safecheck: 5761455)
static with chunk 262144 resulted in 28.770485MN/s (safecheck: 5761455)
dynamic with chunk 1 resulted in 28.236034MN/s (safecheck: 5761455)
dynamic with chunk 64 resulted in 28.871976MN/s (safecheck: 5761455)
dynamic with chunk 256 resulted in 29.238025MN/s (safecheck: 5761455)
dynamic with chunk 2048 resulted in 29.044717MN/s (safecheck: 5761455)
dynamic with chunk 262144 resulted in 28.836751MN/s (safecheck: 5761455)
MN/s describes how many millions of numbers are checked per second on average, default relates to schedule(static) with no chunk size defined and safecheck is for checking if I haven't made any mistakes in the loop causing incorrect prime numbers calculation.
As much sense as your point makes it does seem to all end up making next to no difference between static and dynamic (with exception of static 1) with guided being the clear favourite. Which for the life of me I could not explain.
TL;DR: on your CPU, by default, with the provided input values, the work should already balanced. That being said, the number of threads used have a strong impact on the workload balancing. Moreover, on AMD CPUs, the chosen scheduling impact the code alignment which itself significantly impacts performance. GCC flags can mitigate this sneaky side effect.
I have an uneven workload loop
Well... Is this really true?
I profiled this with the provided input on my machine (having a AMD Ryzen 5700U) and the workload imbalance appeared to be actually tiny with schedule(static,4) (and the dynamic one too). This make sense: the n value is so big that the work per thread is not that different. Indeed, for example, the iteration just before the last one, the first thread (with the least work) computes the values from i=99999874 to i=99999878 while the last thread (with the most work) computes the values from i=99999932 to i=99999936. The number of j-loop iteration is the same since floor(sqrt(99999874)) - floor(sqrt(99999932)) = 0 so the workload gap is more impacted by the presence of prime numbers than the workload imbalance only cause by the number of j-loop iterations. Note this is not true for much smaller n values though where dynamic scheduling can help (and does).
On top of that, modern PC CPUs generally adapt their frequency regarding the number of cores used (e.g. see turbo-boost enabled especially when few cores are working) in order to fulfil a power budget. Moreover, I think SMT (i.e. having 2 thread per core) causes the work to be better balanced with static scheduling than if there was no SMT. Thus, the theoretical downside of static scheduling is mitigated a bit.
This means dynamic schedule can hardly be better since it was supposed to fix a problem that does not really exists (the impact is statistically insignificant on my machine on few runs). Put it shortly, the workload is quite balanced in practice here.
I tested the code on a 6-core Intel i5-9600KF CPU and got very different results. It turns out that the load imbalance is very dependent of the number of threads used (so the number of logical cores by default). As a results, the schedule(static,4) was much smaller than schedule(static,64) which was also slower than schedule(dynamic,64).
This load imbalance is caused by a very variable number of iterations executed by each threads. To clearly see that by measuring it experimentally and printing it:
#pragma omp parallel
{
int64_t j_loop_iter_count = 0;
int i_loop_iter_count = 0;
int local_primes = 0;
#pragma omp for schedule(static, 4) // this is the scheduling selection in question
for(int i = m; i <= n; ++i){
i_loop_iter_count++;
for(int j = 2; j*j <= i; ++j){ // loop has uneven workload because each j loop iterations are dependant on the value of i
j_loop_iter_count++;
if(prime_array[j] && !(i % j)){
result[i-m] = false;
++local_primes;
break;
}
}
}
printf("%d: %ld -- %d -- %d\n", omp_get_thread_num(), j_loop_iter_count, i_loop_iter_count, local_primes);
#pragma omp atomic
total_primes -= local_primes;
}
Here is the results (reordered for sake of clarity):
With 4 threads (on 6 cores):
0: 4317619835 -- 12500000 -- 11749569
1: 4317869877 -- 12500000 -- 11749744
2: 4317278917 -- 12500000 -- 11749636
3: 4317122087 -- 12499999 -- 11749916
With 6 threads (on 6 cores):
0: 2162111600 -- 8333336 -- 7958093
1: 2161685634 -- 8333335 -- 7958162
2: 4310877102 -- 8333332 -- 7582993 <---- !!!
3: 2161290915 -- 8333332 -- 7958215
4: 2161910050 -- 8333332 -- 7958119
5: 4312015415 -- 8333332 -- 7583283 <---- !!!
With 8 threads (on 6 cores):
0: 2159301762 -- 6250000 -- 5874677
1: 2158468946 -- 6250000 -- 5875061
4: 2158318073 -- 6250000 -- 5874892
3: 2158230966 -- 6250000 -- 5875039
6: 2158461332 -- 6250000 -- 5874878
5: 2159400931 -- 6250000 -- 5874683
2: 2158817585 -- 6250000 -- 5874758
7: 2158891121 -- 6249999 -- 5874877
We can see that with 6 threads, 2 threads have far more work than others with schedule(static,4) and that this is due to the j-based loop since all thread have about the same number of i-based loop iterations to execute.
This does not happen much with schedule(static,64) or schedule(dynamic,64):
With schedule(static,64):
0: 2832878607 -- 8333376 -- 7841184
1: 2833791375 -- 8333375 -- 7840948
2: 2966907432 -- 8333312 -- 7817527 <----
3: 2832292445 -- 8333312 -- 7841268
4: 2833947601 -- 8333312 -- 7840883
5: 2970073256 -- 8333312 -- 7817055 <----
With schedule(dynamic,64):
0: 2898319091 -- 8403648 -- 7900154
1: 2828990870 -- 8200448 -- 7708252
2: 2892417491 -- 8371136 -- 7868203
3: 2886043732 -- 8363392 -- 7860985
4: 2885323080 -- 8353664 -- 7851703
5: 2878796452 -- 8307711 -- 7809568
These effects happens because of the break and more specifically because of the prime divisibility property of wheels in math. This can be more easily seen here:

We can see the numbers divisible by prime numbers smaller than them and bigger than 2, so numbers that will trigger an early break. I colored in dark gray the columns divisible by 2 and 3. With 6 threads, 4 threads have more composite numbers because of the columns containing numbers divisible by 3 which are not present with 4 threads. With 6 threads, 2 threads have more work than others because of that.
As long as the number of threads is a power of two and the scheduling chunk is also a power of 2 strictly bigger than 1, then this fine: there will be not much work imbalance.
This effect fully explain the behavior on the i5-9600KF CPU. With 4 threads, the schedule(dynamic,64) is only <1% slower than schedule(dynamic,4) (mainly because the work is very-slightly unbalanced). With 6 threads, schedule(dynamic,64) is significantly faster because the work is far better balanced than the really bad schedule(dynamic,4). Using bigger chunks helps to better balance the work though this is perfectible with schedule(static,64).
On the AMD Ryzen 5700U (8 cores), a complex low-level effects happen. It seem to impact other AMD CPUs like yours, but interestingly not Intel ones.
I have seen several weird effects on my AMD CPU. First, removing the line result[i-m] = false; made the work faster which is pretty crazy at first glance. Then I saw that adding some counters in loops also made the code faster (although this is more work to do). It also causes schedule(static,4) to be now slower than schedule(dynamic,64) although it was initially the opposite. I though that timing results where noisy but they are stable... at least as long as we do not change the code a bit.
The culprit of this madness is the alignment of the assembly instructions in memory. Indeed, regarding the actual instruction used and their location in memory, they may or may not fit on the same number of cache lines. Besides, instructions crossing cache lines can also results in performances issues. This is for example the case on some specific Intel CPU (see JCC erratum). On my AMD CPU, I can see with perf that there are much more front-end stalls when the code is slower which tends to confirm that this is typically a problem related to the layout/alignment of the instructions in memory. Moreover, I can see that the address of the instructions changes from one schedule to another. Fortunately, we can force this with GCC using the flags -falign-functions=4096 -falign-loops=64 -falign-jumps=64. With that enabled, the timing of schedule(static,4) and schedule(dynamic,64) are very close: the gap between the two is less than 1%!
The rule of thumb is that changing the OpenMP schedule can have side effects and these side effects can have a bigger performance impact than the one of the OpenMP scheduling chosen! Actually, this is true for any kind of optimization: code/data alignment can make developers wrongly believe they optimized the code and draw wrong general conclusions like "XXX scheduling is faster than YYY scheduling". Meanwhile, in practice, this can change from one CPU to another, or even one execution to another (e.g. depending on what the other background processes do, the state of virtual memory, the CPU governor, or even the CPU temperature). For more information about this, you can watch the great "Performance Matters" talk of Emery Berger (see on Youtube).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With