I am having troubles with memory leaks on Dask workers. Everytime one of the workers reach 80% of their memory limit, they stall and do not compute anything any more:
Here you can see four panels "Bytes stored", "Task stream", "Progress" and "Task processing".
The "Bytes stored" panel shows the amount of memory occupied (x-axis) by each of the workers (y-axis).
The "Task Stream" panel is a visualization of the threads (y-axis) and the runtime needed to process a task (x-axis). Note that every worker has two threads.
The "Task Processing" panel shows a visualization of the task distribution across workers. Dask balances the amount of work to do, i.e. it makes sure that workers always have similar amounts of tasks to process.
The "Progress" panel simply shows the processing stages and how much of the stages' tasks are already completed/in memory/waiting to be computed.
This is a simple
top
-like overview of the workers and their memory limits, etc.
As you can see, workers 1, 2 and 3 have low CPU usage (~ 5%) and store 6GB of memory. I.e. they hit their memory limit of 80% and do not accept any new tasks.
Setting lifetime="20 mintues", lifetime_restart=True
helps as it restarts the worker s from time to time.
However, when a worker reaches the memory limit very fast, it just stalls for ~ 20min until it gets restarted.
Is there some better way to restart workers earlier? I do not want to lower the lifetime too much since long-running tasks might not be able to finish then.
The best strategy would be IMHO the following:
The policy that you're looking for is described here: https://distributed.dask.org/en/latest/worker.html#memory-management
You can remove the 80% freeze limit and cause things to restart more quickly by changing configuration. These configuration values are documented here: https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.memory.target
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With