我正在独立模式下运行 Spark 集群,应用程序使用 spark-submit。在 Spark UI 的阶段部分中,我发现有一个执行时间很长的执行阶段(> 10h,通常时间为 ~30 sec)。该阶段有许多失败的任务,错误为 Resubmitted (resubmitted due to lost executor)
。在阶段页面的聚合度量按执行器
部分中,有一个地址为CANNOT FIND ADDRESS
的执行器。Spark 会一直尝试重新提交此任务。如果我杀死这个阶段(我的应用程序会自动重新运行未完成的 Spark 作业),所有工作都会继续顺利进行。
此外,我在 Spark 日志中发现了一些奇怪的条目(与阶段开始执行的时间相同)。
Master:
16/11/19 19:04:32 INFO Master: Application app-20161109161724-0045 requests to kill executors: 0
16/11/19 19:04:36 INFO Master: Launching executor app-20161109161724-0045/1 on worker worker-20161108150133
16/11/19 19:05:03 WARN Master: Got status update for unknown executor app-20161109161724-0045/0
16/11/25 10:05:46 INFO Master: Application app-20161109161724-0045 requests to kill executors: 1
16/11/25 10:05:48 INFO Master: Launching executor app-20161109161724-0045/2 on worker worker-20161108150133
16/11/25 10:06:14 WARN Master: Got status update for unknown executor app-20161109161724-0045/1
工作者:
16/11/25 10:06:05 INFO Worker: Asked to kill executor app-20161109161724-0045/1
16/11/25 10:06:08 INFO ExecutorRunner: Runner thread for executor app-20161109161724-0045/1 interrupted
16/11/25 10:06:08 INFO ExecutorRunner: Killing process!
16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
16/11/25 10:06:14 INFO Worker: Asked to launch executor app-20161109161724-0045/2 for app.jar
16/11/25 10:06:17 INFO SecurityManager: Changing view acls to: spark
16/11/25 10:06:17 INFO SecurityManager: Changing modify acls to: spark
16/11/25 10:06:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
由于worker、master(如上面的日志)和driver都在同一台机器上运行,因此网络连接没有问题。
Spark版本1.6.1。