Spark Shell - __spark_libs__.zip不存在。

10
我是新手使用Spark,正在忙于启用启用HA的Spark集群配置。通过以下方式启动Spark shell进行测试:bash spark-shell --master yarn --deploy-mode client。但是我会收到以下错误信息(请参见下面的完整错误信息):file:/tmp/spark-126d2844-5b37-461b-98a4-3f3de5ece91b/__spark_libs__3045590511279655158.zip不存在。应用程序在yarn web app上标记为失败,并且没有启动任何容器。如果通过spark-shell --master local方式启动shell,则不会出现错误。我注意到只有在创建shell的节点的tmp文件夹中才会写入文件。如有需要,请提供更多信息。
16/11/30 21:08:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
16/11/30 21:08:49 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 
16/11/30 21:09:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e14_1480532715390_0001_02_000003 on host: slave2. Exit status: -1000. Diagnostics: File file:/tmp/spark-126d2844-5b37-461b-98a4-3f3de5ece91b/__spark_libs__3045590511279655158.zip does not exist 
java.io.FileNotFoundException: File file:/tmp/spark-126d2844-5b37-461b-98a4-3f3de5ece91b/__spark_libs__3045590511279655158.zip
does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

16/11/30 22:29:28 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! 16/11/30 22:29:28 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: Spark context stopped while waiting for backend
        at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:584)
        at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:546)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
        at $line3.$read$$iw$$iw.<init>(<console>:15)
        at $line3.$read$$iw.<init>(<console>:31)
        at $line3.$read.<init>(<console>:33)
        at $line3.$read$.<init>(<console>:37)
        at $line3.$read$.<clinit>(<console>)
        at $line3.$eval$.$print$lzycompute(<console>:7)
        at $line3.$eval$.$print(<console>:6)
        at $line3.$eval.$print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
        at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
        at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
        at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
        at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
        at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
        at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
        at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
        at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
        at org.apache.spark.repl.Main$.doMain(Main.scala:68)
        at org.apache.spark.repl.Main$.main(Main.scala:51)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

yarn-site.xml

<configuration>
  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>yarn-cluster</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>master:2181,slave1:2181,slave2:2181</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>master:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>master:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>master:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>master:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>master:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>master:23141</value>
  </property>

  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>slave1:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>slave1:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>slave1:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>slave1:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>slave1:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>slave1:23141</value>
  </property>

  <property>
    <description>Address where the localizer IPC is.</description>
    <name>yarn.nodemanager.localizer.address</name>
    <value>0.0.0.0:23344</value>
  </property>
  <property>
    <description>NM Webapp address.</description>
    <name>yarn.nodemanager.webapp.address</name>
    <value>0.0.0.0:23999</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/tmp/pseudo-dist/yarn/local</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/tmp/pseudo-dist/yarn/log</value>
  </property>
  <property>
    <name>mapreduce.shuffle.port</name>
    <value>23080</value>
  </property>
  <property>
    <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
    <value>true</value>
  </property>
</configuration>

所有数据节点上的 /tmp 路径是否具有 rwx 权限? - Nirmal Ram
我刚刚确认,所有节点上的 /tmp 目录都具有 rwx 权限。您认为这会导致文件写入问题还是无法检索文件? - Ryan Nel
可能是这样。尝试直接在 /tmp/spark-126d2844-5b37-461b-98a4-3f3de5ece91b/ 路径下创建文件。是否显示任何权限问题? - Nirmal Ram
@NirmalRam 我尝试了https://dev59.com/h-k6XIcBkEYKwwoYBvfx中建议的修复方法。我已经在我的问题中放置了HADOOP_CONF_DIR。我相信它被正确设置了。我也按照建议创建了目录,没有任何问题。 - Ryan Nel
请查看您的Yarn应用程序日志。是否有任何错误? - facha
显示剩余2条评论
4个回答

6

这个错误是由core-site.xml文件中的配置引起的。

请注意,要找到此文件,必须设置HADOOP_CONF_DIR环境变量。

在我的情况下,我将HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop/添加到./conf/spark-env.sh中。

参见:Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node

core-site.xml

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property> 
</configuration>

如果无法访问此端点,或者Spark检测到文件系统与当前系统相同,则lib文件将不会分发到群集中的其他节点,从而导致上述错误。在我的情况下,我所在的节点无法访问指定主机上的9000端口。
调试
将日志级别提高到info。您可以通过以下方式实现:
1.将./conf/log4j.properties.template复制到./conf/log4j.properties文件中 2.在文件中设置log4j.logger.org.apache.spark.repl.Main = INFO
像往常一样启动您的Spark Shell。如果您的问题与我的相同,您应该会看到诸如“INFO Client:Source and destination file systems are the same. Not copying file:/tmp/spark-c1a6cdcd-d348-4253-8755-5086a8931e75/__spark_libs__1391186608525933727.zip”这样的信息消息。
这应该会引导您找到问题,并开始由缺少文件引起的连锁反应。

我在AWS上出现了错误,所以我无法管理云服务中的工作文件。 - Jérémy

1
你必须将配置设置为master("local[*]"),从spark session中移除即可。

0
在我的情况下,Spark用户帐户无法读取/递归到HADOOP_HOME,因此无法读取core-site.xml。
spark@ubuntu$ ls -lrt /opt/hadoop/
ls: cannot open directory '/opt/hadoop/': Permission denied    <--- Cannot read the directory

spark@ubuntu$ ls -lrt /opt
total 20
drwxrwx--- 3 hadoop  1003 4096 Jun 18 20:38 hadoop             <---- Invalid group
drwxr-xr-x 3 spark  spark 4096 Jun 19 04:24 spark

建议运行ls -la $HADOOP_CONF_DIR以确保提交Spark作业的帐户可以读取core-site.xml。


0

我在您的日志中没有看到任何错误,只有警告,您可以通过添加环境变量来避免这些警告:

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

针对异常情况:尝试手动设置Yarn的Spark配置: http://badrit.com/blog/2015/2/29/running-spark-on-yarn#.WD_e66IrJsM

hdfs dfs -mkdir -p  /user/spark/share/lib<br>
hdfs dfs -put $SPARK_HOME/assembly/lib/spark-assembly_*.jar        /user/spark/share/lib/spark-assembly.jar<br>
export SPARK_JAR=hdfs://your-server:port/user/spark/share/lib/spark-assembly.jar

希望这可以帮到你。

我已经添加了产生的错误,我认为这是由于容器故障引起的。 - Ryan Nel

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接