Digital Event Horizon

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This component can quickly detect rare occurrences of issues when Neuron devices fail by tailing monitoring logs. It marks the worker nodes in a defective Neuron device as unhealthy, and promptly replaces them with new worker nodes. By accelerating the speed of issue detection and remediation, it increases the reliability of your ML training and reduces the wasted time and cost due to hardware failure. Now that the error has been detected by the node problem detector, and the recovery agent has automatically taken the action to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node: # Verify the Node scheduling is disabled. kubectl get node NAME STATUS ROLES AGE VERSION ip-100-64-1-48.us-east-2.compute.internal Ready <none> 156m v1.29.0-eks-5e0fdde ip-100-64-103-26.us-east-2.compute.internal Ready <none> 94s v1.29.0-eks-5e0fdde ip-100-64-239-245.us-east-2.compute.internal Ready <none> 154m v1.29.0-eks-5e0fdde ip-100-64-52-40.us-east-2.compute.internal Ready <none> 156m v1.29.0-eks-5e0fdde ip-100-64-58-151.us-east-2.compute.internal NotReady,SchedulingDisabled <none> 27h v1.29.0-eks-5e0fdde You can open the CloudWatch console and verify the metric for NeuronHealthCheck. You can see the CloudWatch NeuronHasError_DMA_ERROR metric has the value 1. After replacement, you can see a new worker node has been created: # The new node with age 28s is the new node kubectl get node NAME STATUS ROLES AGE VERSION ip-192-168-65-77.us-east-2.compute.internal Ready <none> 28s v1.29.0-eks-5e0fddev1.28.5-eks-5e0fdde ip-192-168-81-176.us-east-2.compute.internal Ready <none> 9d v1.29.5-eks-5e0fdde ip-192-168-91-218.us-east-2.compute.internal Ready <none> 9d v1.29.0-eks-5e0fdde ip-192-168-94-83.us-east-2.compute.internal Ready <none> 9d v1.29.0-eks-5e0fdde Let’s look at a real-world scenario, in which you’re running a distributed training job, using an MPI operator as outlined in Llama-2 on Trainium, and there is an irrecoverable Neuron error in one of the nodes. Before the plugin is deployed, the training job will become stuck, resulting in wasted time and computational costs. With the plugin deployed, the node problem detector will proactively remove the problem node from the cluster. In the training scripts, it saves checkpoints periodically so that the training will resume from the previous checkpoint. The following screenshot shows example logs from a distributed training job. The training has been started. (You can ignore loss=nan for now; it’s a known issue and will be removed. For immediate use, refer to the reduced_train_loss metric.) The following screenshot shows the checkpoint created at step 77. Training stopped after one of the nodes has a problem at step 86. The error was injected manually for testing. After the faulty node was detected and replaced by the Neuron plugin for node problem and recovery, the training process resumed at step 77, which was the last checkpoint. Although Auto Scaling groups will stop unhealthy nodes, they may encounter issues preventing the launch of replacement nodes. In such cases, training jobs will stall and require manual intervention. However, the stopped node will not incur further charges on the associated EC2 instance. If you want to take custom actions in addition to stopping instances, you can create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERROR, and NeuronHasError_NC_UNCORRECTABLE_ERROR, and use a CloudWatch Metrics Insights query like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck to sum up these values to evaluate the alarms. The following screenshots show an example. Clean up To clean up all the provisioned resources for this post, run the cleanup script: # neuron-problem-detector-role-$CLUSTER_NAME eksctl delete podidentityassociation \ --service-account-name node-problem-detector \ --namespace neuron-healthcheck-system \ --cluster $CLUSTER_NAME \ --region $AWS_REGION # delete the EKS Cluster cd data-on-eks/ai-ml/trainium-inferentia ./cleanup.sh Conclusion In this post, we showed how the Neuron problem detector and recovery DaemonSet for Amazon EKS works for EC2 instances powered by Trainium and AWS Inferentia. If you’re running Neuron based EC2 instances and using managed nodes or self-managed node groups, you can deploy the detector and recovery DaemonSet in your EKS cluster and benefit from improved reliability and fault tolerance of your machine learning training workloads in the event of node failure. About the authors Harish Rao is a senior solutions architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices. Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music. Geeta Gharpure is a senior software developer on the Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time. Darren Lin is a Cloud Native Specialist Solutions Architect at AWS who focuses on domains such as Linux, Kubernetes, Container, Observability, and Open Source Technologies. In his spare time, he likes to work out and have fun with his family.

Published: 2024-07-25T17:39:39

Today's AI/ML headlines are brought to you by ThreatPerspective

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters