在pyspark中查询HIVE表

35

我正在使用CDH5.5。

我在HIVE默认数据库中创建了一张表,并且能够通过HIVE命令查询它。

输出:

hive> use default;

OK

Time taken: 0.582 seconds


hive> show tables;

OK

bank
Time taken: 0.341 seconds, Fetched: 1 row(s)

hive> select count(*) from bank;

OK

542

Time taken: 64.961 seconds, Fetched: 1 row(s)

然而,我无法在Pyspark中查询该表格,因为它无法识别该表格。

from pyspark.context import SparkContext

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)


sqlContext.sql("use default")

DataFrame[result: string]

sqlContext.sql("show tables").show()

+---------+-----------+

|tableName|isTemporary|

+---------+-----------+

+---------+-----------+


sqlContext.sql("FROM bank SELECT count(*)")

16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql
      return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
    File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",   line 538, in __call__
    File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco
      raise AnalysisException(s.split(': ', 1)[1])
  **pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**

新错误

>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table("default.bank")
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table
    return DataFrame(self._ssql_ctx.table(tableName), self)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.table.
: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
    at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
    at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
    at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
    at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
    at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
    at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
    at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

谢谢

5个回答

51

我们无法直接将Hive表名传递给Hive上下文SQL方法,因为它不理解Hive表名。在pyspark shell中读取Hive表的一种方法是:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("default.bank")
bank.show()

要在Hive表上运行SQL: 首先,我们需要注册从Hive表读取的数据框。 然后我们可以运行SQL查询。

bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()

我已经编辑了答案,包括数据库名称。现在应该可以工作了。 - bijay697
嗨Bijay697,我遇到了错误org.apache.spark.sql.catalyst.analysis.NoSuchTableException。我在原始帖子中更新了错误(在新错误下)。访问HIVE Metastore需要任何特殊配置吗? - Chn
错误信息意味着Hive中不存在该表。您可以尝试在Hive中的默认数据库之外创建另一个数据库中的表。此外,如果您正在集群模式下提交作业,则可能需要传递hive-site.xml文件。 - bijay697
你上面的例子中,'sc'是什么? - Sledge
1
@Sledge 这是SparkContext,在会话中的一个默认变量。 - Jan Sila
显示剩余3条评论

15

SparkSQL自带元数据存储(derby),因此即使系统上未安装hive,它也可以正常工作。这是默认模式。

在上述问题中,您在hive中创建了一个表。您会收到“找不到表”错误,因为SparkSQL正在使用其默认元存储,其中没有您的hive表的元数据。

如果您希望SparkSQL改为使用hive元存储并访问hive表,则必须在spark conf文件夹中添加hive-site.xml。


“spark conf folder” 是什么意思?如果在 Zeppelin 应用程序中运行 pyspark,是否有任何特别需要注意的地方? - Julian

4
我的问题的解决方案是将 hive-site.xml文件 cp 到您的 $SPARK_HOME/conf 目录下,将 mysql-connect-java-*.jar 文件 cp 到您的 $SPARK_HOME/jars 目录下,这个解决方案解决了我的问题。

即使这对我起作用了。但是我需要知道如何在编程中实现,而不是将hive-site.xml复制到spark的conf目录中。 - nayak0765

1
这是我初始化sc以获取hive表记录而不仅仅是其元数据的方法。
from pyspark import SparkConf, SparkContext     
conf = SparkConf().setMaster("yarn-client")     
sc = SparkContext(conf =conf)      
from pyspark import HiveContext      
hive_context=HiveContext(sc)      
data=hive_context.table("database_name.table_name")     
data.registerTempTable("temp_table_name")     
hive_context.sql("select * from temp_table_name limit 10").show()

-8

你可以使用 sqlCtx.sql。将 hive-site.xml 复制到 Spark 配置路径中。

my_dataframe = sqlCtx.sql("Select * from categories")
my_dataframe.show()

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接