如何将Hive表导出为CSV文件？

Question

如何将Hive表导出为CSV文件？

70

我使用这个Hive查询将一个表导出到CSV文件中。

INSERT OVERWRITE DIRECTORY '/user/data/output/test' select column1, column2 from table1;

生成的文件'000000_0'没有逗号分隔符

这是生成CSV文件的正确方式吗？如果不是，请告诉我如何生成CSV文件？

- Dunith Dhanushka

另一个问题，当我将一个大的Hive表保存到HDFS上的几个块时，有时候发现它们的模式会变得不同，即列的类型可能会改变。如何防止这个问题的发生？ - Song

17个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alex9311 · Answer 1

其他答案已经展示了更改默认分隔符的方法。

还有一些使用bash脚本将原始输出转换为csv的方法。但需要考虑3个分隔符，而不仅仅是\001。当你的hive表中有maps时，事情会变得更加复杂。

我编写了一个bash脚本，可以处理来自hive的所有3个默认分隔符(\001 \002和\003)并输出csv。这里提供脚本和更多信息：

Hive Default Delimiters to CSV

Hive's default delimiters are
Row Delimiter => Control-A ('\001')
Collection Item Delimiter => Control-B ('\002')
Map Key Delimiter => Control-C ('\003')
There are ways to change these delimiters when exporting tables but sometimes you might still get stuck needing to convert this to csv.

Here's a quick bash script that can handle a DB export that's segmented in multiple files and has the default delimiters. It will output a single CSV file.

It is assumed that the segments all have the naming convention 000*_0
INDIRECTORY="path/to/input/directory"
for f in $INDIRECTORY/000*_0; do 
  echo "Processing $f file.."; 
  cat -v $f | 
      LC_ALL=C sed -e "s/^/\"/g" | 
      LC_ALL=C sed -e "s/\^A/\",\"/g" | 
      LC_ALL=C sed -e "s/\^C\^B/\"\":\"\"\"\",\"\"/g" | 
      LC_ALL=C sed -e "s/\^B/\"\",\"\"/g" |  
      LC_ALL=C sed -e "s/\^C/\"\":\"\"/g" | 
      LC_ALL=C sed -e "s/$/\"/g" > $f-temp
done
echo "you,can,echo,your,header,here,if,you,like" > $INDIRECTORY/final_output.csv
cat $INDIRECTORY/*-temp >> $INDIRECTORY/final_output.csv
rm $INDIRECTORY/*-temp

在要点上进行更多解释

- AngryCoder · Answer 2

问题的解决方案很好，但我在两个方面发现了一些问题：

As Carter Shanklin said, with this command we will obtain a csv file with the results of the query in the path specified:
```
insert overwrite local directory '/home/carter/staging' row format delimited fields terminated by ',' select * from hugetable;
```
The problem with this solution is that the csv obtained won´t have headers and will create a file that is not a CSV (so we have to rename it).
As user1922900 said, with the following command we will obtain a CSV files with the results of the query in the specified file and with headers:
```
hive -e 'select * from some_table' | sed 's/[\t]/,/g' > /home/yourfile.csv
```
With this solution we will get a CSV file with the result rows of our query, but with log messages between these rows too. As a solution of this problem I tried this, but without results.

因此，为了解决所有这些问题，我创建了一个脚本，执行一系列查询，创建一个文件夹（带有时间戳）来存储结果，重命名获取的文件，删除不必要的文件，并添加相应的标题。

 #!/bin/sh
 QUERIES=("select * from table1" "select * from table2")
 IFS=""
 directoryname=$(echo "ScriptResults$timestamp")
 mkdir $directoryname 
 counter=1 
for query in ${QUERIES[*]}
 do 
     tablename="query"$counter 
     hive -S -e "INSERT OVERWRITE LOCAL DIRECTORY '/data/2/DOMAIN_USERS/SANUK/users/$USER/$tablename' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' $query ;"
     hive -S -e "set hive.cli.print.header=true; $query limit 1" | head -1 | sed 's/[\t]/,/g' >> /data/2/DOMAIN_USERS/SANUK/users/$USER/$tablename/header.csv
     mv $tablename/000000_0 $tablename/$tablename.csv
     cat $tablename/$tablename.csv >> $tablename/header.csv.
     rm $tablename/$tablename.csv
     mv $tablename/header.csv $tablename/$tablename.csv 
     mv $tablename/$tablename.csv $directoryname
     counter=$((counter+1))
     rm -rf $tablename/ 
 done

- Dragon Surfer · Answer 3

尝试

hive --outputformat==csv2 -e "select * from YOUR_TABLE";

这对我有用

我的Hive版本是"Hive 3.1.0.3.1.0.0-78"

- Dattatrey Sindol · Answer 4

我遇到了类似的问题，以下是我解决的步骤。

第一步 - 将hive表中的数据加载到另一个表中，如下所示

DROP TABLE IF EXISTS TestHiveTableCSV; CREATE TABLE TestHiveTableCSV ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' AS SELECT Column List FROM TestHiveTable;

第二步 - 将blob从hive仓库复制到具有适当扩展名的新位置

Start-AzureStorageBlobCopy -DestContext $destContext -SrcContainer "Source Container" -SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0" -DestContainer "Destination Container" ` -DestBlob "CSV/TestHiveTable.csv"

希望这可以帮到您！

最好的问候， Dattatrey Sindol（Datta） http://dattatreysindol.com

- Alex B · Answer 5

如果您正在使用Windows，可以使用Python脚本hivehoney将表数据提取到本地CSV文件中。

它将：

登录堡垒机。
pbrun。
kinit。
beeline（带有您的查询）。
将来自beeline的echo保存到Windows上的文件中。

像这样执行它：

set PROXY_HOST=your_bastion_host

set SERVICE_USER=you_func_user

set LINUX_USER=your_SOID

set LINUX_PWD=your_pwd

python hh.py --query_file=query.sql

- Jianwu Chen · Answer 6

以上选项都不完全适合我。我想解决一些问题：

如果值中有制表符，它不应该破坏 CSV 输出
我需要头部自动添加，无需手动操作
结构、数组或映射字段应该被 JSON 编码

所以我创建了一个 UDF 来解决这个问题。（有点惊讶 Hive 没有内置支持）

用法：

ADD JAR ivy://org.jsonex:HiveUDF:0.1.24?transitive=true;
CREATE TEMPORARY FUNCTION to_csv AS 'org.jsonex.hiveudf.ToCSVUDF';
SELECT to_csv(*) FROM someTable;  -- Default separator and headers
SELECT to_csv('{noHead:true}', *) FROM someTable;  -- No headers
SELECT to_csv('{headers:[,,,col3,]}', *) FROM someTable; -- Custom Headers
SELECT to_csv('{fieldSep:|,quoteChar:\"\\'\"}', *) FROM someTable" -- Custom fieldSep and quoteChar

- Gene M · Answer 7

以下是我用来将Hive表数据导出到HDFS的端到端解决方案，作为一个带标题的CSV文件。
（遗憾的是它不可能用一个HQL语句完成）
它由几个命令组成，但我觉得相当直观，而且它不依赖于Hive表的内部表示，这可能会随时改变。
如果您想将数据导出到本地文件系统而不是HDFS，请将"DIRECTORY"替换为"LOCAL DIRECTORY"。

# cleanup the existing target HDFS directory, if it exists
sudo -u hdfs hdfs dfs -rm -f -r /tmp/data/my_exported_table_name/*

# export the data using Beeline CLI (it will create a data file with a surrogate name in the target HDFS directory)
beeline -u jdbc:hive2://my_hostname:10000 -n hive -e "INSERT OVERWRITE DIRECTORY '/tmp/data/my_exported_table_name' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM my_exported_table_name"

# set the owner of the target HDFS directory to whatever UID you'll be using to run the subsequent commands (root in this case)
sudo -u hdfs hdfs dfs -chown -R root:hdfs /tmp/data/my_exported_table_name

# write the CSV header record to a separate file (make sure that its name is higher in the sort order than for the data file in the target HDFS directory)
# also, obviously, make sure that the number and the order of fields is the same as in the data file
echo 'field_name_1,field_name_2,field_name_3,field_name_4,field_name_5' | hadoop fs -put - /tmp/data/my_exported_table_name/.header.csv

# concatenate all (2) files in the target HDFS directory into the final CSV data file with a header
# (this is where the sort order of the file names is important)
hadoop fs -cat /tmp/data/my_exported_table_name/* | hadoop fs -put - /tmp/data/my_exported_table_name/my_exported_table_name.csv

# give the permissions for the exported data to other users as necessary
sudo -u hdfs hdfs dfs -chmod -R 777 /tmp/data/hive_extr/drivers