I was trying to use different number of threads in different parts of a program to achieve the maximum acceleration. However it is found that switching the thread number using the num_threads clause incurs significant overhead. I am looking for an explanation for this, since according to my understanding the thread pool should always contain a given number of threads, regardless of the actual number that got invoked. I am also looking for possible work-arounds against this. Thanks.
Sample code:
#include<cstdio>
#include<omp.h>
void omp_sum(int ntd) {
    int s = 0;
    #pragma omp parallel num_threads(ntd)
    {
        int i = omp_get_thread_num();
        #pragma omp atomic
        s += i;
    }
}   
int main()
{
    int N = 100;
    int NT1 = 6, NT2 = 12;
    double t;
    t = omp_get_wtime();
    for(int n=0;n<N;n++) {
        omp_sum(NT1);
    }
    printf("%lf\n", (omp_get_wtime() - t) * 1e6 );
    t = omp_get_wtime();
    for(int n=0;n<N;n++) {
        omp_sum(NT2);
    }
    printf("%lf\n", (omp_get_wtime() - t) * 1e6 );
    t = omp_get_wtime();
    for(int n=0;n<N;n++) {
        omp_sum(NT1);
        omp_sum(NT1);
    }
    printf("%lf\n", (omp_get_wtime() - t) * 1e6 );
    t = omp_get_wtime();
    for(int n=0;n<N;n++) {
        omp_sum(NT2);
        omp_sum(NT2);
    }
    printf("%lf\n", (omp_get_wtime() - t) * 1e6 );
    t = omp_get_wtime();
    for(int n=0;n<N;n++) {
        omp_sum(NT1);
        omp_sum(NT2);
    }
    printf("%lf\n", (omp_get_wtime() - t) * 1e6 );
}
Sample output (in us):
1034.069001
1058.620000
1034.572000
2210.681000
18234.355000
EDIT: The workstation running the code has 2 hexa-core Intel E5-2630L CPUs, so there should be a total of 12 hardware cores and 24 hyperthreads. I was using Fedora 19 with GCC 4.8.2.
I can reproduce your results with GCC 4.8 (g++ -O3 -fopenmp foo.cpp) on my four core system/eight hyper thread system. I changed N1 to 4 and N2 to 8.
Your function omp_sum is simple
pushq   %rbx    
movq    %rdi, %rbx
call    omp_get_thread_num
movq    (%rbx), %rdx
lock addl   %eax, (%rdx)
popq    %rbx
ret
Here is the assembly code for the loop
for(int n=0;n<N;n++) {
    omp_sum(NT1);
    omp_sum(NT2);
}
.L10
leaq    32(%rsp), %rsi
xorl    %ecx, %ecx
movl    $4, %edx
movl    $_Z7omp_sumi._omp_fn.0, %edi
movl    $0, 28(%rsp)
movq    %rbx, 32(%rsp)
call    GOMP_parallel
leaq    32(%rsp), %rsi
xorl    %ecx, %ecx
movl    $8, %edx
movl    $_Z7omp_sumi._omp_fn.0, %edi
movl    $0, 28(%rsp)
movq    %rbx, 32(%rsp)
call    GOMP_parallel
subl    $1, %ebp
jne .L10
This is almost identical to the assembly for the loop
for(int n=0;n<N;n++) {
    omp_sum(NT2);
    omp_sum(NT2);
}
The only change being movl $4, %edx instead of movl  $8, %edx. So it's hard to see what is causing the problem. All the magic happens in GOMP_parallel. One would have to look at the source code of GOMP_parallel but my guess is that GOMP_parallel checks the number of threads that were last used in a parallel call and if the new parallel call uses a different number of threads it has some overhead to switch.  That overhead is much larger than your simple function.
But I'm not sure why this would ever be a problem. In practice it does not make sense to use such short parallel sections (one would parallelize a loop and N would be much larger) so the overhead should not be problem.
Edit: Section 2.41 of the OpenMP 3.1 specification titled "Determining the Number of Threads for a parallel Region" gives the algorithm for determining the number of threads. The source code for GOMP_parallel from GCC-4.8 shows that the first function it calls is gomp_resolve_num_threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With