Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask Distributed - Same persist data multiple clients

We are trying Dask Distributed to make some heavy computes and visualization for a frontend.

Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist into the cluster.

I've tried using pickle to save the futures from the persist dataframe, but it doesn't work.

We want to have multiple gunicorn workers, each with a different client connecting to the same cluster and using the same data, but with more workers each one uploads a new dataframe.

like image 880
CValenzu Avatar asked Dec 06 '25 04:12

CValenzu


1 Answers

It sounds like you are looking for Dask's abililty to publish datasets

A convenient way to do this is to using the client.datasets mapping

Client 1

client = Client('...')
df = dd.read_csv(...)
client.datasets['my-data'] = df

Client 2..n

client = Client('...')  # same scheduler
df = client.datasets['my-data']
like image 119
MRocklin Avatar answered Dec 11 '25 22:12

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!