Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EMR on EKS: Dynamic Allocation + FSx Lustre -- Executors with shuffle data won't terminate despite idle timeout

Having trouble getting dynamic allocation to properly terminate idle executors when using FSx Lustre for shuffle persistence on EMR 7.8 (Spark 3.5.4) on EKS. Trying this strategy out to battle cost via severe data skew (I don't really care if a couple nodes run for hours while the rest of the fleet deprovisions)

Setup:

  • EMR on EKS with FSx Lustre mounted as persistent storage
  • Using KubernetesLocalDiskShuffleDataIO plugin for shuffle data recovery
  • Goal: Cost optimization by terminating executors during long tail operations

Issue:
Executors scale up fine and FSx mounting works, but idle executors (0 active tasks) are not being terminated despite 60s idle timeout. They just sit there consuming resources. Job is running successfully with shuffle data persisting correctly in FSx. I previously had DRA working without FSx, but a majority of the executors held shuffle data so they never deprovisioned (although some did).

Questions:

  1. Is the KubernetesLocalDiskShuffleDataIO plugin preventing termination because it thinks shuffle data is still needed?
  2. Are my timeout settings too conservative? Should I be more aggressive?
  3. Any EMR-specific configurations that might override dynamic allocation behavior?

Has anyone successfully implemented dynamic allocation with persistent shuffle storage on EMR on EKS? What am I missing?

Configuration:

"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.shuffleTracking.enabled": "true", 
"spark.dynamicAllocation.minExecutors": "1",
"spark.dynamicAllocation.maxExecutors": "200",
"spark.dynamicAllocation.initialExecutors": "3",
"spark.dynamicAllocation.executorIdleTimeout": "60s",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "90s",
"spark.local.dir": "/data/spark-tmp",
"spark.shuffle.sort.io.plugin.class": "org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "fsx-lustre-pvc",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false",
"spark.kubernetes.driver.ownPersistentVolumeClaim": "true", 
"spark.kubernetes.driver.waitToReusePersistentVolumeClaim": "true",
"spark.shuffle.file.buffer.size": "1m", 
"spark.sql.adaptive.localShuffleReader.enabled": "false", 
"spark.eventLog.enabled": "true",
"spark.sql.adaptive.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",

Environment:
EMR 7.8.0, Spark 3.5.4, Kubernetes 1.32, FSx Lustre

like image 668
metersk Avatar asked Dec 07 '25 03:12

metersk


1 Answers

I suspect it could be due to setting

"spark.dynamicAllocation.shuffleTracking.enabled": "true"

From the config documentation:

Enables shuffle file tracking for executors, which allows dynamic allocation without the need for an external shuffle service. This option will try to keep alive executors that are storing shuffle data for active jobs.

I think the intent for this option is to allow scaling executors up/down during different phases in the query plan which may require more/less resources. However, stages which require shuffling may need to access shuffle data stored on executors and it may be unsafe or less performant to remove them even though they may not be actively computing during a shuffle operation (as you pointed out).

You should be fine to disable shuffleTracking since the requirement for using dynamic allocation is satisfied by using a custom shuffleDataIO plugin (although the config docs tag this option as experimental under spark.dynamicAllocation.enabled). In theory, if your ShuffleDriverComponents is working properly, it should persist the shuffle data making it safe to remove executors not actively computing during a high skew shuffle stage.

The docs for dynamic resource allocation have more details which might help with further debugging.

like image 163
jwong Avatar answered Dec 08 '25 19:12

jwong



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!