Spark-Shell错误:方案没有文件系统:wasb。

9

我们在Azure中运行了HDInsight集群,但是它不允许在群集创建时启动边缘/网关节点。因此,我通过安装来创建此边缘/网关节点

echo 'deb http://private-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.4.2.0 HDP main' >> /etc/apt/sources.list.d/HDP.list
echo 'deb http://private-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu14 HDP-UTILS main'  >> /etc/apt/sources.list.d/HDP.list
echo 'deb [arch=amd64] https://apt-mo.trafficmanager.net/repos/azurecore/ trusty main' >> /etc/apt/sources.list.d/azure-public-trusty.list
gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD
gpg -a --export 07513CAD | apt-key add -
gpg --keyserver pgp.mit.edu --recv-keys B02C46DF417A0893
gpg -a --export 417A0893 | apt-key add -
apt-get -y install openjdk-7-jdk
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
apt-get -y install hadoop hadoop-hdfs hadoop-yarn hadoop-mapreduce hadoop-client openssl libhdfs0 liblzo2-2 liblzo2-dev hadoop-lzo phoenix hive hive-hcatalog tez mysql-connector-java* oozie oozie-client sqoop flume flume-agent spark spark-python spark-worker spark-yarn-shuffle

我复制了以下文件夹:/usr/lib/python2.7/dist-packages/hdinsight_common/ /usr/share/java/ /usr/lib/hdinsight-datalake/ /etc/spark/conf/ /etc/hadoop/conf/

但是当我运行spark-shell时,出现以下错误:

java.io.IOException: No FileSystem for scheme: wasb

这是完整的技术栈:https://gist.github.com/anonymous/ebb6c9d71865c9c8e125aadbbdd6a5bc

我不确定这里缺少哪个包/ jar文件。

有没有人知道我做错了什么?

谢谢


我正在寻找类似问题的解决方案。可能在这里得到帮助:https://dev59.com/uI7ea4cB1Zd3GeqPCoya - aaronsteers
2个回答

8

在 spark-shell 中设置 Azure 存储(wasb 和 wasbs 文件)的另一种方式是:

  1. Copy azure-storage and hadoop-azure jars in the ./jars directory of spark installation.
  2. Run the spark-shell with the parameters —jars [a comma separated list with routes to those jars] Example:

    
    $ bin/spark-shell --master "local[*]" --jars jars/hadoop-azure-2.7.0.jar,jars/azure-storage-2.0.0.jar
    
  3. Add the following lines to the Spark Context:

    
    sc.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
    sc.hadoopConfiguration.set("fs.azure.account.key.my_account.blob.core.windows.net", "my_key")
    
  4. Run a simple query:

    
    sc.textFile("wasb://my_container@my_account_host/myfile.txt").count()
    
  5. Enjoy :)

通过这些设置,您可以轻松地设置一个Spark应用程序,并将参数传递给当前Spark Context中的'hadoopConfiguration'。


我的错。我必须停止使用Mac Notes来保存代码片段 :) - NicolasKittsteiner
是的,现在好多了 :) 这也是一个非常好的解决方案,我给你点赞。 - Philip P.
8
hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") 在我的情况下不起作用。(Spark 2.3.1, Hadoop 2.7.3)。我不得不设置hadoopConfiguration.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")。现在可以了。 - noleto
@noleto 谢谢你写下那个评论! - akki

1
海宁(来自微软)撰写了一篇关于在Apache Hadoop安装中设置WASB的优秀博客文章。
以下是摘要:
  1. Add hadoop-azure-*.jar and azure-storage-*.jar to hadoop classpath

    1.1 Find the jars in your local installation. It's at /usr/hdp/current/hadoop-client folder on HDInsight cluster.

    1.2 Update HADOOP_CLASSPATH variable at hadoop-env.sh. Use exact jar name as java classpath doesn't support partial wildcard.

  2. Update core-site.xml

    <property>         
            <name>fs.AbstractFileSystem.wasb.Impl</name>                           
            <value>org.apache.hadoop.fs.azure.Wasb</value> 
    </property>
    
    <property>
            <name>fs.azure.account.key.my_blob_account_name.blob.core.windows.net</name> 
            <value>my_blob_account_key</value> 
    </property>
    
    <!-- optionally set the default file system to a container --> 
    <property>
            <name>fs.defaultFS</name>          
            <value>wasb://my_container_name@my_blob_account_name.blob.core.windows.net</value>
    </property>
    

具体步骤请参见此处:https://github.com/hning86/articles/blob/master/hadoopAndWasb.md


谢谢您的建议,但对于特定的用例,我不能使用通过集群部署部署的客户端。 - roy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接