使用Spark SQL向Hive表插入数据

Question

使用Spark SQL向Hive表插入数据

3

我将从一个JSON文件中读取一些数据，并将其转换为字符串，然后用该字符串发送数据到Hive。

数据在Hive中到达得很好，但它被分配到了错误的列中，我举了一个小例子。

在Hive中：

Table name = TestTable, Column1 = test1, Column2 = test2`

我的代码：

data = hiveContext.sql("select \"hej\" as test1, \"med\" as test2")
data.write.mode("append").saveAsTable("TestTable")

data = hiveContext.sql("select \"hej\" as test2, \"med\" as test1")
data.write.mode("append").saveAsTable("TestTable")

这导致"hej"两次出现在test1中，"med"两次出现在test2中，而不是每个都显示一次。似乎总是按照书写顺序显示，而不是按照使用'as'关键字指定的列进行排序。有任何想法吗？

- Luffen

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Samson Scharfrichter · Accepted Answer

看起来总是按照写入顺序显示...

你说得对。 Spark的工作方式类似于任何SQL数据库，输入数据集中的列名没有任何区别。
由于您没有明确将输出列映射到输入列，因此Spark必须假定映射是按位置完成的。

请思考以下测试案例...

hiveContext.sql("create temporary table TestTable (RunId string, Test1 string, Test2 string)")
hiveContext.sql("insert into table TestTable select 'A', 'x1', 'y1'")
hiveContext.sql("insert into table TestTable (RunId, Test1, Test2) select 'B', 'x2' as Blurb, 'y2' as Test1")
hiveContext.sql("insert into table TestTable (RunId, Test2, Test1) select 'C', 'x3' as Blurb, 'y3' as Test1")
data = hiveContext.sql("select 'xxx' as Test1, 'yyy' as Test2"))
data.registerTempTable("Dummy")
hiveContext.sql("insert into table TestTable(Test1, RunId, Test2) select Test1, 'D', Test2 from Dummy")
hiveContext.sql("insert into table TestTable select Test1, 'E', Test2 from Dummy")
hiveContext.sql("select * from TestTable").show(20)

免责声明 - 我实际上没有测试这些命令，可能会有一些错别字和语法问题（特别是因为你没有提到你的Hive和Spark版本），但你应该能明白要点。