I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.
In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?
Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.
FFTW + OpenMP 
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL
float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version
fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
                                                 n, //dims -- this doesn't include zero-padding
                                                 depth, //howmany
                                                 h_in, //in
                                                 inembed, //inembed
                                                 depth, //istride
                                                 1, //idist
                                                 h_freq, //out
                                                 onembed, //onembed
                                                 depth, //ostride
                                                 1, //odist
                                                 FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
    fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
cuFFT 
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              depth, //istride
              1, //idist
              onembed, //onembed
              depth, //ostride
              1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);
double start = read_timer();
for(int i=0; i<nIter; i++){
    CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
Other notes
cudaMemcpys between the CPU and GPU are not included in my computation time.nvvp profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).
When structuring your data for cufftPlanMany, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern
input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]
in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b is the batch number). Here are the changes I applied:
    CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              1,  // WAS: depth, //istride
              nRows*cols_padded, // WAS: 1, //idist
              onembed, //onembed
              1, // WAS: depth, //ostride
              nRows*cols_padded, // WAS:1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));
Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo).
// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth)); 
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2)); 
When running it this way, I got a x8 performance improvement on my card.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With