Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clarification of Asynchronous Engine Count in Turing architecture

Tags:

cuda

gpu

The scenario is that I am aware of the concurrent copy and execution mechanism introduced back in Fermi and further enhanced in later architectures, described in the CUDA C++ Best Practices Guide:

Current GPUs can simultaneously process asynchronous data transfers and execute kernels. GPUs with a single copy engine can perform one asynchronous data transfer and execute kernels whereas GPUs with two copy engines can simultaneously perform one asynchronous data transfer from the host to the device, one asynchronous data transfer from the device to the host, and execute kernels. The number of copy engines on a GPU is given by the asyncEngineCount field of the cudaDeviceProp structure, which is also listed in the output of the deviceQuery CUDA Sample.

When I execute the deviceQuery sample of CUDA 10.0 on Turing GPUs (RTX 2080Ti and RTX 2080 SUPER), it shows asyncEngineCount equal to 3.

I can only imagine that with 2 copy engines, a kernel can execute concurrently alongside an H2D as well as a D2H copies ( a total of 3 concurrent operations ). So, what is the function of 3rd engine in Turing GPUs?

like image 738
sgarizvi Avatar asked Sep 05 '25 03:09

sgarizvi


1 Answers

This question could be answered in a single word, if StackOverflow allowed that: NVLink.

E.g. with two cards connected via NVLink, the extra copy engine per card allows you to perform bidirectional peer-to-peer copies over the NVLink at full bandwidth, in addition to full bandwidth host<->device transfers.

With more than two cards not all links can be saturated at the same time with only three copy engines per card. However with increasing link count it also becomes increasingly unlikely that all links would be used at the same time, as this scheme would quickly run out of host memory bandwidth.

like image 130
tera Avatar answered Sep 08 '25 01:09

tera