I am trying to run the following code on a Power PC with config:
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
Kernel: Linux 3.10.0-957.21.3.el7.ppc64le
Architecture: ppc64-le
single node localcluster with 20 cores.
import os, subprocess
from timeit import default_timer as timer
from dask.distributed import Client, LocalCluster, fire_and_forget, as_completed
def run_client(n_workers):
files = []
for dirpaths, dirnames, filenames in os.walk('cap_logs/'):
if not dirnames:
files.extend([os.path.join(dirpaths, file) for file in filenames])
def parser(file):
val = subprocess.run(['./test.sh', file], stdout=subprocess.PIPE)
return val.stdout.decode()
cluster = LocalCluster(n_workers=n_workers, dashboard_address=None)
with Client(cluster) as client:
futures = []
files = client.scatter(files)
futures = client.map(parser, files)
results = [future.result() for future in as_completed(futures)]
del futures
cluster.close()
workers = [20, 18, 16, 14, 12, 10, 8, 7, 6, 5, 4, 3, 2, 1]
times = {}
for n_workers in workers:
tic = timer()
run_client(n_workers)
toc = timer()
time = toc - tic
times[n_workers] = round(time, 2)
It works fine if n_workers is relatively smaller (<15) than the total num of cores i.e. 20 but as soon as I set n_workers to be >15 it gives the following error:
OSError: Timed out trying to connect to 'tcp://127.0.0.1:34487' after 10 s: connect() didn't finish in time
I'm surprised you're seeing timeouts like that with so few workers. But even so, you might want to try supplying a longer connect
timeout to the distributed.timeouts
section of your dask config:
distributed:
comm:
timeouts:
connect: 10s # time before connecting fails
tcp: 30s # time before calling an unresponsive connection dead
The full default config can be found in the source code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With