Spark在yarn模式下以"Exit status: -100. Diagnostics: Container released on a *lost* node"结束。

25

我正在尝试使用最新的EMR在AWS上将1TB数据加载到Spark数据库中。然而,运行时间非常长,即使经过6个小时仍未完成,但是在运行了6小时30分钟后,我遇到了一些错误提示Container released on a lost node,并且作业失败了。日志如下:

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

我相信我的网络设置没问题,因为我已经在一个更小的表上运行了这个脚本。

另外,我知道有人六个月前发布了一个关于同样问题的问题:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released,但是没有人回答这个问题,所以我还是不得不问一下。


1
@clay 只是我的猜测。当价格高于您的价格时,即会收回Spot实例,然后节点将丢失。因此,如果您正在运行长期作业,请勿使用Spot实例。我找到了一种方法,将数据集分成许多小任务,每个任务只运行5分钟,并在s3上保存减少结果,之后从s3读取结果并进行另一次减少,这样我就可以避免长时间运行的作业。 - John Zeng
我也遇到了这个问题 :/ - Prayag
这里也有类似的问题(不过是一个大的自连接)。我已经遇到这个问题一段时间了。资源管理器上的日志只显示容器丢失了,没有任何原因的指示。内存可能是一个问题。 - Navneet
@ssedano 不好意思...该实例早已被删除。而且日志文件太大,您也不想去读它们。 - John Zeng
遇到了同样的问题。我们也使用了spot instance,但不确定这是否是根本原因,因为我们使用了相当高的出价,并且在作业执行期间从未失去任何实例。 - seiya
显示剩余2条评论
8个回答

14

看起来其他人也遇到了同样的问题,所以我写下了一个答案而不是一条评论。我不确定这是否能解决问题,但这应该是一个思路。

如果你在使用竞价实例,你应该知道如果价格高于你的出价,那么实例将会被关闭,你就会遇到这个问题。即使你只是把竞价实例作为一个从属实例。所以我的解决方案是不要在长时间运行的作业中使用任何竞价实例。

另一个想法是将作业分成许多独立的步骤,这样你可以将每个步骤的结果保存为S3上的一个文件。如果发生任何错误,只需从缓存的文件开始重新开始执行哪个步骤。


所以根据您的解决方案,第一种选择是:获取专用的CORE节点而不是SPOT任务节点。第二个选项是基本上将您的作业分成多个作业,并逐步手动运行它们? - thentangler

3

这是动态内存分配吗?我遇到过类似的问题,通过计算执行器内存、执行器核心和执行器来进行静态分配来解决它。 尝试在Spark中为大型工作负载使用静态分配。


在编程中,将未使用的数据框取消持久化是否有帮助?在这种情况下呢? - user3937422
你可以尝试一下,你是在EMR还是Cloudera堆栈上?同时检查yarn调度程序以进行资源管理,它是公平的还是容量的,然后尝试静态内存分配,而不是通过传递执行器数量等动态分配。 - sri hari kali charan Tummala
我正在使用EMR,但在使用unpersist之后,我没有发现动态内存变化的任何区别。 - user3937422
我要求你通过关闭动态内存分配,使用静态内存分配,并通过计算来传递执行器的数量、执行器内存和执行器核心数,而不是将其留给Spark动态内存分配。 - sri hari kali charan Tummala

2
最初的回答:你的YARN容器已经停止,要调试发生了什么问题,您必须阅读YARN日志,可以使用官方CLI yarn logs -applicationId或者随意使用并为我的项目https://github.com/ebuildy/yoga作出贡献,这是一个作为Web应用程序的YARN查看器。您应该会看到很多Worker错误。

2
我遇到了同样的问题。我在这篇DZone文章中找到了一些线索:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr

通过增加DataFrame分区数量(在此情况下从1,024增加到2,048),此问题得以解决。这减少了每个分区所需的内存。


因此,我尝试增加DataFrame分区的数量,解决了我的问题。



0

亚马逊提供了他们的解决方案,通过资源分配来处理,从用户的角度来看没有处理方法。


1
目前你的回答不够清晰,请编辑并添加更多细节,以帮助其他人理解它如何回答问题。你可以在帮助中心找到有关如何编写好答案的更多信息。 - Community

0

对于我的情况,我们使用了带有2个Pre-Emptible(默认)辅助工作节点的GCP Dataproc集群。

对于短时间运行的作业,这不是问题,因为主要和辅助工作节点都很快完成。

然而,对于长时间运行的作业,观察到所有主要工作节点相对于辅助工作节点都很快完成分配的任务。

由于可抢占性质,对于分配给辅助工作节点的任务,在运行3小时后容器会丢失。因此,导致容器丢失错误。

我建议不要将辅助工作节点用于任何长时间运行的作业。


0
检查托管容器的节点的CloudWatch指标和实例状态日志:节点可能因为高磁盘利用率或硬件问题而被标记为不健康。
在前一种情况下,您应该在AWS EMR UI中的"MR unhealthy nodes"指标中看到非零值,在后一种情况下,您应该在"MR lost nodes"指标中看到非零值。请注意,磁盘利用率阈值是在YARN中使用"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"设置进行配置的,默认值为"90%"。与容器日志类似,AWS EMR将实例状态的快照以及大量有用信息(如磁盘利用率、CPU利用率、内存利用率和堆栈跟踪)导出到S3中,因此请查看它们。要找到节点的EC2实例ID,请将容器日志中的IP地址与AWS EMR UI中的ID进行匹配。
aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/                                                 
                           PRE containers/
                           PRE node/
                           PRE steps/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/ 
                           PRE applications/                                                                                                                                                          
                           PRE daemons/                                                                                                                                                               
                           PRE provision-node/                                                                                                                                                        
                           PRE setup-devices/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/daemons/instance-state/
2023-09-24 13:13:33        748 console.log-2023-09-24-12-08.gz
2023-09-24 13:18:34      55742 instance-state.log-2023-09-24-12-15.gz
...
2023-09-24 17:33:58      60087 instance-state.log-2023-09-24-16-30.gz
2023-09-24 17:54:00      66614 instance-state.log-2023-09-24-16-45.gz
2023-09-24 18:09:01      60932 instance-state.log-2023-09-24-17-00.gz

cat /tmp/instance-state.log-2023-09-24-16-30.gz
...
# amount of disk free
df -h
Filesystem        Size  Used Avail Use% Mounted on
...
/dev/nvme0n1p1     10G  5.7G  4.4G  57% /
/dev/nvme0n1p128   10M  3.8M  6.2M  38% /boot/efi
/dev/nvme1n1p1    5.0G   83M  5.0G   2% /emr
/dev/nvme1n1p2    1.8T  1.7T  121G  94% /mnt
/dev/nvme2n1      1.8T  1.7T  120G  94% /mnt1
...

更多信息,请参考以下资源。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接