Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Databricks - Failure Starting REPL

I am using a databricks cluster to run some ETLs.

During the night, there is a peak in executions, no library is installed during executions.

Spark version is 3.2.0 and scala version is 2.12. Runtime 10.2.

During the execution peaks, sometimes a failure to start the python REPL causes a failure in some notebooks that are vital to the process.

The error can be seen in the first image.

This error is happening since last month. I have incresead the max executors from 2 to 3, but the erros is still happening some days. The cluster information can be seen in the second image. The peak generally executes 50 ETLs at the same time.

The failure happened between 10:05 and 10:15.

Worker information can be found in the third image.

Error

Memory and CPU info

Cluster Worker Info

like image 772
Anthony Davies Avatar asked Oct 25 '25 18:10

Anthony Davies


1 Answers

Actually, the real issue was the Driver.

The max workers were set to 3, but sometimes, during the execution peak, the number of worker remained 2. Exploring the issue, I identified that the whole problem happened during the commands interpretation, not during tasks processing. So, the driver suffered to proccess all the code and organize the tasks during this peak.

The easy solution was to increase tha machine memory and cores, but this would cause a increase of US$ 10 000 per year (Only driver) and US$ 10 000 per Year per worker. (20 hours a day, every day). Because of the cost, I decided to develop something similar to a job Pool. Which limits the number of executions sent to the driver and spread them during the time.

This solution solved the problem and saved a cost of up to US$ 40 000 during the next 365 days of operation.

like image 199
Anthony Davies Avatar answered Oct 27 '25 06:10

Anthony Davies



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!