I have a spark streaming job that run on EMR, read messages from Kafka and output to S3.
I use emr-5.17.0, i.e hadoop 2.8.4, spark 2.3.1
The problem is that shuffle files are being accumulated in: /mnt/yarn/usercache/hadoop/appcache/application_1540126328960_0001/
and never get deleted until I run out of space on disk
The files looks like: shuffle_328_127_0.index, shuffle_328_134_0.data
I did try to update Yarn's policy like so: yarn.nodemanager.localizer.cache.cleanup.interval-ms 300000 yarn.nodemanager.localizer.cache.target-size-mb 5000
But it did not solve the problem.
Currently I restart the job automatically and start a new one every few hours and when the application is stopped it deletes the cache
What can I do in order to make yarn delete the cache files?
Thanks
I had a cron job (hourly) clean up the files/directories older than 6 hours to fix the disk out of space issue, i did not find a parameter in spark/yarn that would do this automatically, here are the details.
crontab entry.
0 * * * * /home/hadoop/clean_appcache.sh >/dev/null 2>&1
clean_appcache.sh
#!/bin/bash
BASE_LOC=/mnt/yarn/usercache/hadoop/appcache
sudo find $BASE_LOC/ -mmin +360 -exec rmdir {} \;
sudo find $BASE_LOC/ -mmin +360 -exec rm {} \;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With