PyTorch copy parameter gets stuck in multiprocessing if parameters too big

Question

I'm trying to code an Asynchronous Actor Critic in PyTorch based on this repo: https://github.com/seungeunrho/minimalRL/blob/master/a3c.py but I'm changing the ActorCritic class to use the one I coded myself.

Basically I have a class A3C, an instance of it, global_model, with shared memory and I use torch.multiprocessing to open some Processes in order to train the model in parallel. In each process at the beginning I have to create a new instance of the model, called local_model, in order to proceed with the training, but the process gets stuck in the initialization of the local model even though the one of the global model works every time.

Trying to debugging it I can see that it enters the A3C.init function and the SharedActorCritic.init too, but there it stops just after I put the checkpoint print. However if I print whatever expression contains list(critic_param_gen) magically everything works. I also noted that printing just critic_param_gen won't do.

Any idea of why is that?

Also a similar thing happens if I use local_model = copy.deepcopy(global_model) as a function create_local_model, i.e. only works if that print is present.

In pseudo-code:

import torch.multiprocessiA3Cng as mp
import torch.nn as nn
import itertools as it

debug = True

A3C(nn.Module):
  def __init__(self, model, n_features):
     ... 
     self.AC_architecture = SharedActorCritic(model, n_features)

class SharedActorCritic(nn.Module):
    def __init__(self, model, n_features):
        super(SharedActorCritic, self).__init__()

        self.shared_architecture = model(n_features) # inherits from nn.Module
        self.actor = SharedActor(n_features) # inherits from nn.Module
        self.critic = SharedCritic(n_features) # inherits from nn.Module

        self.critic_target = BaseCritic(model, n_features) # inherits from nn.Module

        critic_param_gen = it.chain(self.shared_architecture.parameters(), self.critic.parameters())
        print("checkpoint")
        if debug: print(list(critic_param_gen)) # this makes the whole thing work
        for trg_params, params in zip(self.critic_target.parameters(), critic_param_gen ):
            trg_params.data.copy_(params.data)

def create_local_model(model, n_features):
    local_model = A3C(model, n_features)
    print("Process ended")

# in the main
global_model = Model() # works
global_model.share_memory() # doesn't really matter

p = mp.Process(target=create_local_model, args=(model, n_features, ))
p.start()
print("Process started")
p.join()

----
# output if debug is True
Process started
checkpoint
[ ...actual list of critic_param_gen ... ]
Process ended

# output if debug is False
Process started
checkpoint
# and then runs forever

Edit: solved the mystery about the print statement thanks to snakecharmerb. I created a minimal reproducible example. It seems that if the network is large enough, the copy operation breaks if executed in a process, but not outside of it (since global model can be instantiated).

import torch.nn as nn
import torch.multiprocessing as mp
import copy

class Net(nn.Module):
    def __init__(self, n_features=256, n_layers=8):
        super(Net, self).__init__()
        self.net1 = nn.Sequential(*nn.ModuleList([nn.Linear(n_features, n_features) for _ in range(n_layers)]))
        self.net2 = nn.Sequential(*nn.ModuleList([nn.Linear(n_features, n_features) for _ in range(n_layers)]))

        for p1, p2 in zip(self.net1.parameters(), self.net2.parameters()):
            p1.data.copy_(p2.data)

    def forward(self, x):
        return self.net(x)

def create_local_model_v1(global_model):
    local_model = copy.deepcopy(global_model)
    print("Process ended")

%%time
global_model = Net(16,2)
print("Global model created")
p = mp.Process(target=create_local_model_v1, args=(global_model,))
p.start()
print("Process started")
p.join()

# Output
Global model created
Process ended
Process started
CPU times: user 3 ms, sys: 11.9 ms, total: 14.9 ms
Wall time: 45.1 ms

%%time
global_model = Net(256,8)
print("Global model created")
p = mp.Process(target=create_local_model_v1, args=(global_model,))
p.start()
print("Process started")
p.join()

# Output - Gets stuck
Global model created
Process started

Nathan Chappell · Accepted Answer

TLDR: use torch.multiprocessing.spawn

I'm not quite skilled enough to determine the exact cause and solution to this error, but the problem occurs at this point in torch/nn/parameter.py:

result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)

This gets called during the deep copy process. To investigate a little more, I put together a somewhat more detailed experiment to test what parameters and environments cause the hang. The jist of the results is that the size of the model is not an issue, but rather how many features / issues can cause problems. For me, 256 features causes the hang, regardless of how many layers. Another more curious issue is that when I remove the part of initialization where the parameters from net1 get copied to net2, the hang disappears, however if I don't send anything to another process then everything works fine. Finally, when using the spawn function, everything works just fine until the number of layers exceeds 256.

I need to caveat everything about the hang, as far as I can tell it is a deadlock, but it may be just some extremely slow process. This is highly unlikely, because it seems as though all activity stops, however I couldn't confirm that it's a deadlock because I when I went for backtrace of the C code during the hang, all I got was memory address (to really confirm everything I guess I need to rebuild torch with some debugging options...). Anyways, I'm about 99% confident it's a deadlock, probably being caused by something in multiprocessing somewhere. The reason my confidence is so high is that the code won't even react to signals. If everything were working as expected, I would expect the program to at least allow me to print out a traceback from a signal handler, but nothing.

I found the following blog post to be somewhat nice: The tragic tale of the deadlocking Python queue

Other than that, my opinion at this point is f*** combining torch and multiprocessing.

If anyone cares to see the code for the experiments I ran or the result, let me know.

PyTorch copy parameter gets stuck in multiprocessing if parameters too big

Tags:

python

multiprocessing

pytorch

python-itertools

nic96

1 Answers

Nathan Chappell

Recent Activity

Donate For Us

PyTorch copy parameter gets stuck in multiprocessing if parameters too big

Tags:

python

multiprocessing

pytorch

python-itertools

nic96

1 Answers

Nathan Chappell

Related questions

Recent Activity

Donate For Us