Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch Sharing CUDA tensors

I have a question considering sharing gpu-tensors between processes using the torch.multiprocessing module. Here is a minimal example:

import torch
import torch.multiprocessing as mp
from torch.nn import Embedding
import time
device = 'cuda' if torch.cuda.is_available() else 'cpu'

def worker(rank, tensor):
    if rank == 0:
        tensor.weight.data[rank,rank] = 99 #tensor is on gpu
    if rank == 1:
        time.sleep(1)
        print("worker", rank, tensor.weight) #worker1 sees modified shared memory from worker0
 

if __name__ == '__main__':
    tensor = Embedding(2, 5, sparse=False) #init with random values
    tensor = tensor.to(device) #send to gpu
    print("main", tensor.weight)
    processes = []
    for rank in range(2):  # number of workers
        p = mp.Process(target=worker, args=(rank, tensor))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

Now what happens is, the main process moves the tensor to gpu and then gives it to its child processes through the arguments. But the child gets

tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]], device='cuda:0', requires_grad=True)

instead of the actual tensor. The memory between the processes is shared and the tensor is on gpu, but why is it re-init with zeros? I've read best practices and CUDA in multiprocessing and tried to use mp.Queue to send the tensor but it had the same result.
I am on Windows10 with torch version 2.5.0+cu124 and Python 3.10.4.

like image 665
Daniel Avatar asked Dec 07 '25 03:12

Daniel


1 Answers

TL;DR

It comes down to the OS you are using. If your default process start method is spawn (Windows, MacOS), which does not preserve memory state, then that is the reason the child receives a re-initialized tensor of zeros.

Full Answer

When you call start() on mp.Process to begin each worker, you are likely spawning in the process. According to the python multiprocessing module documentation, spawning starts a fresh python interpreter process. Here, the child process does not inherit nonessential memory from the objects in the parent, including the tensor you initialized.

If you are on macOS or Linux, you could alternatively use set_start_method() to make fork your main start method. When you fork a process, ALL resources of the parent are inherited by the child process -- though forking can be problematic in larger multithreaded processes. This should solve your issue with the re-initialized tensor, though I cannot test this as I'm on a Windows system, which does not natively support forking.

Other Workarounds, but not using a CUDA tensor

  1. Normally when sharing torch tensors across multiple processes, the torch.tensor.share_memory_() method can be called to move the tensor to shared memory. In your case, this won't work because you are trying to share a CUDA tensor -- this can only be done directly through the CUDA API.

  2. PyTorch offers a Distributed RPC Framework which allows for remote communication between distinct processes, allowing them to reference objects partitioned among them. Of interest to you would be the Remote Reference (RREF) API, which "serves as a distributed shared pointer to a local or remote object," more here. Unfortunately, CUDA is still a beta feature within the RPC Framework, and RRef is an unsupported feature using CUDA tensors.

These workarounds are safer and more universal alternatives to forking the workers (supported on POSIX, not Windows). They might be worth a try, but will only work by keeping your tensor on your CPU.

like image 60
maticos Avatar answered Dec 09 '25 06:12

maticos