I have got problem with shuffle instruction in CUDA 5.0.
This is snippet of my kernel. It is inside the loop. Print is there only for debug purpose because I can't use ordinary debugger:
...
tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos);
if (threadIdx.x == 0)
{
    left = left_value[y_pos];
}
else
{
    printf("thread %d; shfl value: %f \n", threadIdx.x, __shfl_up(value, 1));
    left = __shfl_up(value, 1);
}
printf("thread %d; value: %f; tex_val: %f; left: %f \n", threadIdx.x, value, tex_val, left);
...
From that I get this output:
l0:  ITERATION 1
l1:  thread 0; value: 0; tex_val: 1; left: 4
l2: 
l3:  ITERATION 2
l4:  thread 1; shfl value: 0
l5:  thread 0; value: 5; tex_val: 1; left: 5
l6:  thread 1; value: 0; tex_val: 1; left: 0
l7: 
l8:  ITERATION 3
l9:  thread 1; shfl value: 0
l10: thread 2; shfl value: 1
l11: thread 0; value: 6; tex_val: 1; left: 6
l12: thread 1; value: 1; tex_val: 1; left: 0
l13: thread 2; value: 2; tex_val: 1; left: 1
...
From the output I can see that thread 1 doesn't get value from thread 0 in any iteration even though I can clearly see that it has value (line 4 - shfl value is 0; line 5 - value is 5). Thread 2 and higher can get value from lower thread. Where am I making mistake? Is it happening because of the branching?
Yes, it's because of the branching. Quoting from the CUDA programming guide B.14.2:
The
__shfl()intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp, ...
and
Threads may only read data from another thread which is actively participating in the
__shfl()command. If the target thread is inactive, the retrieved value is undefined.
In a branch, active threads are those taking the same path of execution, while those taking different ones are inactive. In your case, thread 0 is inactive, so you cannot shuffle from it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With