Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark streaming job doesn't delete shuffle files

I have a spark streaming job that run on EMR, read messages from Kafka and output to S3.

I use emr-5.17.0, i.e hadoop 2.8.4, spark 2.3.1

The problem is that shuffle files are being accumulated in: /mnt/yarn/usercache/hadoop/appcache/application_1540126328960_0001/

and never get deleted until I run out of space on disk

The files looks like: shuffle_328_127_0.index, shuffle_328_134_0.data

I did try to update Yarn's policy like so: yarn.nodemanager.localizer.cache.cleanup.interval-ms 300000 yarn.nodemanager.localizer.cache.target-size-mb 5000

But it did not solve the problem.

Currently I restart the job automatically and start a new one every few hours and when the application is stopped it deletes the cache

What can I do in order to make yarn delete the cache files?

Thanks

like image 243
user2128732 Avatar asked Jan 01 '26 22:01

user2128732


1 Answers

I had a cron job (hourly) clean up the files/directories older than 6 hours to fix the disk out of space issue, i did not find a parameter in spark/yarn that would do this automatically, here are the details.

crontab entry.

0 * * * * /home/hadoop/clean_appcache.sh >/dev/null 2>&1

clean_appcache.sh

#!/bin/bash

BASE_LOC=/mnt/yarn/usercache/hadoop/appcache
sudo find $BASE_LOC/ -mmin +360 -exec rmdir {} \;
sudo find $BASE_LOC/ -mmin +360 -exec rm {} \;
like image 134
user2677485 Avatar answered Jan 05 '26 20:01

user2677485



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!