Spark-Hadoop -> org.apache.hadoop.mapred.InvalidInputException:输入路径不存在。

7

我在尝试从HDFS中读取文件到Spark时遇到了错误。README.md文件在HDFS中存在。

 spark@osboxes hadoop]$ hdfs dfs -ls README.md
16/02/26 00:29:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 spark supergroup       4811 2016-02-25 23:38 README.md

在Spark shell中,我输入了
scala> val readme = sc.textFile("hdfs://localhost:9000/README.md")
readme: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:27

scala> readme.count
16/02/26 00:25:26 DEBUG BlockManager: Getting local block broadcast_4
16/02/26 00:25:26 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(true, true, false, true, 1)
16/02/26 00:25:26 DEBUG BlockManager: Getting block broadcast_4 from memory
16/02/26 00:25:26 DEBUG HadoopRDD: Creating new JobConf and caching it for later re-use
16/02/26 00:25:26 DEBUG Client: The ping interval is 60000 ms.
16/02/26 00:25:26 DEBUG Client: Connecting to localhost/127.0.0.1:9000
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: starting, having connections 1
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark sending #4
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark got value #4
16/02/26 00:25:26 DEBUG ProtobufRpcEngine: Call: getFileInfo took 6ms
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/README.md
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
        at org.apache.spark.rdd.RDD.count(RDD.scala:1143)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
        at $iwC$$iwC$$iwC.<init>(<console>:43)
        at $iwC$$iwC.<init>(<console>:45)
        at $iwC.<init>(<console>:47)
        at <init>(<console>:49)
        at .<init>(<console>:53)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


scala> 16/02/26 00:25:36 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: closed
16/02/26 00:25:36 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: stopped, remaining connections 0

在core-site.xml文件中,我有以下条目:
<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

并且hdfs-site.xml具有以下细节:
<configuration>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

我这里是不是漏了些什么? 我的操作系统是CentOS Linux 7.2.1511(核心版),Hadoop是2.7.2版本,Spark是1.6.0-bin-hadoop2.6版本。


在URI中添加user/spark后,我成功地从HDFS访问了Spark中的README.md。scala> val readme = sc.textFile("hdfs://localhost:9000/user/spark/README.md") readme: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:27 scala> readme.count res1: Long = 141 - Raxbangalore
5个回答

6
这是由目录之间的内部映射引起的。首先进入文件(README.md)所在的目录。运行命令:df -k。 您将得到目录的实际挂载点。例如:/xyz 现在,尝试在此挂载点中查找您的文件(README.md)。例如:/xyz/home/omi/myDir/README.md 在您的代码中使用此路径。 val readme = sc.textfile("/xyz/home/omi/myDir/README.md");

4

默认情况下,hdfs dfs -ls 会列出您在 hdfs 上的用户主目录,而不是 hdfs 的根目录。您可以通过比较 hdfs dfs -lshdfs dfs -ls / 命令的输出来轻松验证这一点。当您使用完整的 hdfs URL 时,您正在使用绝对路径,并且它找不到您的文件(因为它位于您的用户主目录中)。当您使用相对路径时,问题解决了 :)

您可能需要知道,hdfs dfs -put 命令也将使用您的 hdfs 主目录作为文件默认存储位置,而不是 hdfs 的根目录。


0
在我的情况下,README.md文件位于Spark文件夹(spark-2.4.3-bin-hadoop2.7)中,该文件夹在我的主目录中。
这样完整的路径是“/home/sdayneko/spark-2.4.3-bin-hadoop2.7/README.md”
我将此路径放入输入变量中:
val input = sc.textFile("/home/sdayneko/spark-2.4.3-bin-hadoop2.7/README.md")

之后,它就可以工作了 :)


0

我曾遇到此问题,发现表格损坏会导致出现此问题。

show partitions myschema.mytable; 结果为: partitionkey=abc partitionkey=xyz

如果在表格文件夹上执行hdfs的ls命令

ls -ltr hdfs://servername/data/fid/work/hive/myschema/mytable partitionkey=abc

会得到与分区不匹配的分区文件夹。

通过spark读取时...会出现此问题。

org.apache.hadoop.mapred.invalidinputexception input path does not

您需要删除分区或使用msck修复表格以解决此问题。 谢谢和问候, Kamleshkumar Gujarathi


-1

你可以尝试将你的命令更改为以下内容,然后运行

val readme = sc.textFile("./README.md")

scala> val readme = sc.textFile("./README.md")readme: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:27scala> readme.countorg.apache.hadoop.mapred.InvalidInputException: 输入路径不存在:file:/home/spark/README.md at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) .... - Raxbangalore

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接