How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. The master node manages the cluster and coordinates the distribution of data and tasks among other nodes for processing.
You can't restart a terminated cluster, but you can clone a terminated cluster to reuse its configuration for a new cluster. For more information, see Cloning a cluster using the console.
Limitations of an EMR cluster with multiple master nodes: If any two master nodes fail simultaneously, Amazon EMR cannot recover the cluster. Amazon EMR clusters with multiple master nodes are not tolerant to Availability Zone failures.
View cluster status using the AWS CLI You can use the describe-cluster command to view cluster-level details including status, hardware and software configuration, VPC settings, bootstrap actions, instance groups, and so on. For more information about cluster states, see Understanding the cluster lifecycle.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With