Spark streaming job doesn't delete shuffle files

Question

I have a spark streaming job that run on EMR, read messages from Kafka and output to S3.

I use emr-5.17.0, i.e hadoop 2.8.4, spark 2.3.1

The problem is that shuffle files are being accumulated in: /mnt/yarn/usercache/hadoop/appcache/application_1540126328960_0001/

and never get deleted until I run out of space on disk

The files looks like: shuffle_328_127_0.index, shuffle_328_134_0.data

I did try to update Yarn's policy like so: yarn.nodemanager.localizer.cache.cleanup.interval-ms 300000 yarn.nodemanager.localizer.cache.target-size-mb 5000

But it did not solve the problem.

Currently I restart the job automatically and start a new one every few hours and when the application is stopped it deletes the cache

What can I do in order to make yarn delete the cache files?

Thanks

user2677485 · Accepted Answer

I had a cron job (hourly) clean up the files/directories older than 6 hours to fix the disk out of space issue, i did not find a parameter in spark/yarn that would do this automatically, here are the details.

crontab entry.

0 * * * * /home/hadoop/clean_appcache.sh >/dev/null 2>&1

clean_appcache.sh

#!/bin/bash

BASE_LOC=/mnt/yarn/usercache/hadoop/appcache
sudo find $BASE_LOC/ -mmin +360 -exec rmdir {} \;
sudo find $BASE_LOC/ -mmin +360 -exec rm {} \;

Spark streaming job doesn't delete shuffle files

Tags:

apache-kafka

apache-spark

spark-streaming

user2128732

1 Answers

user2677485

Recent Activity

Donate For Us

Spark streaming job doesn't delete shuffle files

Tags:

apache-kafka

apache-spark

spark-streaming

user2128732

1 Answers

user2677485

Related questions

Recent Activity

Donate For Us