I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.
If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With