I read two posts on Stack Overflow, namely Will the cublas kernel functions automatically be synchronized with the host? and CUDA Dynamic Parallelizm; stream synchronization from device and they recommend the use of some synchronization API, e.g., cudaDeviceSynchronize() after invocations to cuBLAS functions. I'm not sure it makes sense to use such a general purpose function.
Would it be better to do as follows? [Correct me if I'm wrong]:
cublasHandle_t cublas_handle;
cudaStream_t stream;
// Initialize the matrices
CUBLAS_CALL(
  cublasDgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, M, M, 
    M, &alpha, d_A, M, d_B, M, &beta, d_C, M));
// cublasDgemm is non-blocking!
cublasGetStream(cublas_handle, &stream);
cudaStreamSynchronize(stream);
// Now it is safe to copy the result (d_C) from the device
// to the host and use it
On the other hand, cudaDeviceSynchronize can be used preferably if lots of streams/handles were used to perform parallel cuBLAS operations. What are the "best practices" for the synchronization of cuBLAS handles? Can cuBLAS handles be thought of as wrappers around streams, in the sense that they serve the same purpose from the point of view of synchronization?
If you are using a single stream, it doesn't make a difference whether you will synchronize that one stream or you use cudaDeviceSynchronize(). In terms of performance and effect it should be exactly the same. Note that when using events to time part of your code (e.g., a cublas call) it's always good practice to call cudaDeviceSynchronize() to get meaningful measurements. From my experience, it doesn't impose any significant overhead and, besides, it's safer to time your kernels with it. 
If your application uses multiple streams, then it makes sense to synchronize only against the stream you want. I believe that this question will be helpful to you. Also, you can read the CUDA C Programming guide, Section 3.2.5.5.
It's not clear in your example that you would need to use explicit synchronization at all or why you would need to use it.
CUDA operations issued to the same stream are serialized. If you launch a kernel, or a cublas call, and then follow that kernel or cublas call with a cudaMemcpy operation (or cublasGetVector/Matrix, etc.), the copy operation is guaranteed not to start until all previous CUDA activity issued to the same stream is complete.
The best practice for general cases is not to use explicit synchronization at all. Place activities which must be serially dependent in the same stream. Place activities which have no dependency on each other in separate streams.
There are many cuda codes, using cublas and otherwise, that don't use explicit synchronization at all. Your example has no particular need of it. Note that in the first answer you linked, talonmies said:
you need to call a blocking API routine like a synchronous memory transfer or...
In your example, that is exactly what you would do.  You would call a memory transfer, either issued to the same stream (e.g. cudaMemcpyAsync) or default blocking transfer (like cudaMemcpy) and it would work just fine.  No need for an explicit sync.
You may wish to read the appropriate programming guide section
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With