I am trying to implement a global reduction kernel in PTX which uses shared memory for reduction within a thread block (like all the CUDA C examples out there). In CUDA C on has the possibility to define an variable length array in shared memory with
extern __shared__ float sdata[];
How can I get the equivalent in PTX ?
What doesn't seem appropriate is a fixed length array like
.shared .f32 sdata[ LENGTH ];
Since I want the kernel to be reusable for different input array lengths.
What I could do is define one variable
.shared .f32 sdata;
and use it as the base address of the array. In the hope that it is allocated at the beginning of shared memory. I could then access array element like
ld.shared.f32 %r4,[sdata + <offset>]
Also this looks a bit funny because sdata is defined as a float. But what it really is is the address of a float. In this sense the above line is indeed correct.
However I am not sure if this is guaranteed to be correct, say as long as the offset is not greater than the shared memory size specified at kernel launch.
The PTX manual doesn't explain variable length buffers in shared memory.
Anyone knows how to implement a variable length buffer in PTX?
This works. However it is not the perfect solution because it introduces an extern linkage variable.
.version 2.3
.target sm_20
.extern .shared .align 4 .b8 sdata[];
.entry func (.param .s32 param0,...)
{
//
// Base addresses
mov.u64 w2,sdata; // shared memory
ld.shared.s32 i9,[w2+0];
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With