I am writing a cuda kernel to copy an array to another. Both of them are in GPU memory. I don't want to use cudamemcpyDeviceToDevice because of its poor performance.
The naive kernel:
__global__ void GpuCopy( float* des , float* __restrict__ sour ,const int M , const int N )
{
int tx=blockIdx.x*blockDim.x+threadIdx.x;
if(tx<N*M)
des[tx]=sour[tx];
}
I think the naive kernel will not get high performance, so I try to use __shared__ memory but it looks not good:
__shared__ float TILE[tile];
int tid=threadIdx.x;
for(int i=0; i<M*N/tile;i++)
{
TILE[tid]=sour[i*tile+tid]
des[i*tile+tid]=TILE[tid]
}
The former code snippet copies global memory to des[], while the latter copies global memory to __shared__ and then copies __shared__ to des[]. I think that the latter is slower than the former.
So, how to write a __shared__ code to copy memory? Another question is if I want to use __const__ memory and the array (which is already in GPU) is larger than constant memory, how to copy it to anther GPU memory with __const__?
For ordinary linear-to-linear memory copying, shared memory won't give you any benefit. Your naive kernel should be fine. There may be some small optimizations that could be made in terms of running with a smaller number of threadblocks, but tuning this will be dependent on your specific GPU, to some degree.
Shared memory can be used to good effect in kernels that do some kind of modified copying, such as a transpose operation. In these cases, the cost of the trip through shared memory is offset by the improved coalescing performance. But with your naive kernel, both reads and writes should coalesce.
For a single large copy operation, cudaMemcpyDeviceToDevice should give very good performance, as the overhead of the single call is amortized over the entire data movement. Perhaps you should time the two approaches -- it's easy to do with nvprof. The discussions referenced in the comments refer to a specific use-case where matrix quadrants are being swapped. In that case, an NxN matrix requires ~1.5N cudaMemcpy operations, but is being compared to a single kernel call. In that case, the overhead of the API call setup will start to become a significant factor. However, when comparing a single cudaMemcpy operation to a single equivalent kernel call, the cudaMemcpy operation should be fast.
__constant__ memory cannot be modified by device code, so you will have to use a host code based on cudaMemcpyFromSymbol and cudaMemcpyToSymbol.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With