I sometimes see I see the following error message when running Spark jobs:
13/10/21 21:27:35 INFO cluster.ClusterTaskSetManager: Loss was due to spark.SparkException: File ./someJar.jar exists and does not match contents of ...
What does this mean? How do I diagnose and fix this?
After digging around in the logs I found "no space left on device" exceptions too, then when I ran df -h and df -i on every node I found a partition that was full.  Interestingly this partition does not appear to be used for data, but storing jars temporarily.  It's name was something like /var/run or /run.
The solution was to clean the partition of old files and to setup some automated cleaning, I think setting spark.cleaner.ttl to say a day (86400) should prevent it happening again.
Running on AWS EC2 I periodically encounter disk space issues - even after setting the spark.cleaner.ttl to a few hours (we iterate quickly). I decided to solve them by moving the /root/spark/work directory to the mounted ephemeral disk on the instance (I'm using r3.larges which have a 32GB ephemeral at /mnt):
readonly HOST=some-ec2-hostname-here
ssh -t root@$HOST spark/sbin/stop-all.sh
ssh -t root@$HOST "for SLAVE in \$(cat /root/spark/conf/slaves) ; do ssh \$SLAVE 'rm -rf /root/spark/work && mkdir /mnt/work && ln -s /mnt/work /root/spark/work' ; done"
ssh -t root@$HOST spark/sbin/start-all.sh
As far as I can tell as of Spark 1.5 the work directory still does not make use of the mounted storage by default. I haven't tinkered with the deployment settings enough to see if this is even configurable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With