I'm trying to learn Cuda and I'm having trouble to wrap my head around warps and lanes in them. A lane is the word I use for threads inside of the same warp. I can exchange data between lanes as follows:
int lane_id = threadIdx.x & 31;
if(lane_id != 0 && lane_id != 1){return;}
int val;
if(lane_id == 0) val = 42;
if(lane_id == 1) val = 123;
val = __shfl_sync(0b11, val, lane_id ^ 1);
if(lane_id == 0) assert(val == 123);
if(lane_id == 1) assert(val == 42);
I'm wondering whether the call to __shfl_sync needs to be a single call or whether it can also be split over two calls like this:
int lane_id = threadIdx.x & 31;
if(lane_id != 0 && lane_id != 1){return;}
int val;
if(lane_id == 0) val = __shfl_sync(0b11, 42, 1);
if(lane_id == 1) val = __shfl_sync(0b11, 123, 0);
if(lane_id == 0) assert(val == 123);
if(lane_id == 1) assert(val == 42);
The answer is, it depends on the age of the architecture. The PTX documentation describes it in detail:
shfl.syncwill cause executing thread to wait until all non-exited threads corresponding tomembermaskhave executedshfl.syncwith the same qualifiers and samemembermaskvalue before resuming execution.
[…]
For targetsm_6xor below, all threads inmembermaskmust execute the sameshfl.syncinstruction in convergence, and only threads belonging to somemembermaskcan be active when theshfl.syncinstruction is executed. Otherwise, the behavior is undefined.
Therefore, for 7.x and above – essentially every architecture with independent thread scheduling – it is fine if instructions in different branches are used. For older architectures, all threads need to run the same instruction or have exited.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With