CUDA: can __shfl delta be different between lanes?

Question

According to the CUDA documentation, __shfl() intrinsics permit the exchange of a variable between threads. More specifically __shfl_up() and __shfl_down() (also __shfl_xor()) allow the exchange of a variable of a different lane. unsigned int delta is the second argument to these and specifies (positively or negatively) the offset lane id of the lane to exchange the variable from. I had always assumed this delta must be constant for all lanes in a warp. So, for example, one has to call like thus

val += __shfl_down(val, 3);

or

for (i = 1; i < warpSize / 2; i ++)
  val += __shfl_down(val, i);

however, I've just realized there is nothing explicitly stating that delta must be the same for all lanes (as long as all lanes are participating). Therefore, is the following undefined or fine

val += __shfl_down(val, threadIdx.x % warpSize);

This is just for example, and many of the lanes will add nothing as delta "will not wrap around ... so effectively the upper delta lanes will remain unchanged.". There is also nothing specifying that delta must be >0, 0 must just return the same as for out-of-range delta?

Further to this, is the documentation not contradicting itself here?

Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.

So it is undefined for inactive threads, however ...

All __shfl() intrinsics return the 4-byte word referenced by var from the source lane ID as an unsigned integer. If the source lane ID is out of range or the source thread has exited, the calling thread's own var is returned.

... which means that for inactive threads it returns the calling thread's own var. Which further means, that in the my example above, if delta is out-of-bounds it adds itself, not nothing?

Robert Crovella · Accepted Answer

can __shfl delta be different between lanes?

Yes, the delta can be different between lanes. This could be used, for example, to broadcast a 32-bit quantity from a single lane to other lanes (although this is not the only way to perform such a broadcast):

#include <stdio.h>
__global__ void bcast() {
  int value = threadIdx.x;
  value = __shfl_up(value, threadIdx.x); // Get "value" from lane 0
  if (value) printf("Thread %u failed: %d
", threadIdx.x, value);
}

int main() {
  bcast<<< 1, 32 >>>();
  cudaDeviceSynchronize();
  return 0;
}

Regarding your question about delta values, (note that the function prototypes for __shfl_up and __shfl_down insist that delta is an unsigned int quantity, so it can't be negative), the question about a delta value of 0 is not quite clear to me. That would reference the origination lane, which by definition cannot be out-of-range or inactive. So, yes, a zero index value would return the origination lane value, just as an out-of-range index would.

Regarding your last question, an out-of bounds computed source lane ID is not the same thing as an in-bounds computed source lane ID that happens to reference an inactive thread. In the former case, the threads own shuffle variable is returned. In the latter case, the returned value is undefined.

Computation of the source lane ID for delta values greater than 31 may also have some additional complexity. Casually, such computations would appear to be always out-of-bounds (but the actual behavior may be somewhat more complicated). I'm not sure the documentation clearly touches on this, and I'm not sure this is the crux of any of your questions.

CUDA: can __shfl delta be different between lanes?

Tags:

c

cuda

dogAwakeCat

1 Answers

Robert Crovella

Recent Activity

Donate For Us

CUDA: can __shfl delta be different between lanes?

Tags:

c

cuda

dogAwakeCat

1 Answers

Robert Crovella

Related questions

Recent Activity

Donate For Us