As the tensorflow paper states, Tensorflow' cross-device communication is achieved by adding "receive node" and "send node" into devices.
From my understanding, the device(Please considering only CPU devices are involved) is responsible for performing the computation of an operation. However,the data(ex:Tensor produced from an operation, Variable buffer) resides in memory. I don't know how data transfer from one device to another device is achieved physically. I guess the data transfer is achieved by shared memory. Is that right?
I will appreciate any explanation/corresponding codes regarding how the data transfer is achieved. PS: TensorFlow paper link, Figure 4 shows the cross-device communication mechanism.
In TensorFlow, cross-device communication is achieved using the Rendezvous interface, which has multiple different implementations, depending on the deployment. The comment on that interface describes the general idea:
// A Rendezvous is an abstraction for passing a Tensor
// from a producer to a consumer, where the consumer may safely
// request the Tensor before or after it has been produced.  A
// producer never blocks when using a Rendezvous.  A consumer has the
// choice of making a blocking call or providing a callback: in either
// case, the consumer receives the Tensor as soon as it is available.
As you noted in your question, TensorFlow represents communication in the dataflow graph using Send and Recv ops that are added to the graph automatically when the graph is partitioned across devices. For each edge that has a source and destination on different devices, the graph partitioner inserts a pair of Send and Recv ops that share the same "rendezvous key" (an automatically generated string name that is used as a key in the rendezvous' index of pending tensors to be communicated). The implementation of the Send op is simple: it calls Rendezvous::Send(), passing in its rendezvous key and single input tensor, then returns immediately without blocking. The implementation of the Recv op is slightly more complicated: it registers a callback to be called when the tensor with the given key becomes available. That callback is responsible for "producing" the output of the Recv op, and unblocking subsequent computation.
The Rendezvous implementations perform the actual work of transferring the data:
IntraProcessRendezvous handles the transfer of data between devices in the same process. In the (unlikely) event that the transfer is between two CPU devices in the same process, the transfer can be achieved by a simple Tensor assignment. Otherwise, TensorFlow kicks off a device-specific DMA routine to transfer data between a CPU and GPU device.
The BaseRemoteRendezvous class and its subclasses handle cross-device communication in the case that the send and receiver can be in different processes. The main implementation of this class is RpcRemoteRendezvous, which uses gRPC to handle the remote transfers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With