如何将Hive表导出为CSV文件?

70

我使用这个Hive查询将一个表导出到CSV文件中。

INSERT OVERWRITE DIRECTORY '/user/data/output/test' select column1, column2 from table1;

生成的文件'000000_0'没有逗号分隔符

这是生成CSV文件的正确方式吗?如果不是,请告诉我如何生成CSV文件?


另一个问题,当我将一个大的Hive表保存到HDFS上的几个块时,有时候发现它们的模式会变得不同,即列的类型可能会改变。如何防止这个问题的发生? - Song
17个回答

1

其他答案已经展示了更改默认分隔符的方法。

还有一些使用bash脚本将原始输出转换为csv的方法。但需要考虑3个分隔符,而不仅仅是\001。当你的hive表中有maps时,事情会变得更加复杂。

我编写了一个bash脚本,可以处理来自hive的所有3个默认分隔符(\001 \002和\003)并输出csv。这里提供脚本和更多信息:

Hive Default Delimiters to CSV

Hive's default delimiters are

Row Delimiter => Control-A ('\001')
Collection Item Delimiter => Control-B ('\002')
Map Key Delimiter => Control-C ('\003')

There are ways to change these delimiters when exporting tables but sometimes you might still get stuck needing to convert this to csv.

Here's a quick bash script that can handle a DB export that's segmented in multiple files and has the default delimiters. It will output a single CSV file.

It is assumed that the segments all have the naming convention 000*_0

INDIRECTORY="path/to/input/directory"
for f in $INDIRECTORY/000*_0; do 
  echo "Processing $f file.."; 
  cat -v $f | 
      LC_ALL=C sed -e "s/^/\"/g" | 
      LC_ALL=C sed -e "s/\^A/\",\"/g" | 
      LC_ALL=C sed -e "s/\^C\^B/\"\":\"\"\"\",\"\"/g" | 
      LC_ALL=C sed -e "s/\^B/\"\",\"\"/g" |  
      LC_ALL=C sed -e "s/\^C/\"\":\"\"/g" | 
      LC_ALL=C sed -e "s/$/\"/g" > $f-temp
done
echo "you,can,echo,your,header,here,if,you,like" > $INDIRECTORY/final_output.csv
cat $INDIRECTORY/*-temp >> $INDIRECTORY/final_output.csv
rm $INDIRECTORY/*-temp

要点上进行更多解释


1
问题的解决方案很好,但我在两个方面发现了一些问题:
  • As Carter Shanklin said, with this command we will obtain a csv file with the results of the query in the path specified:

    insert overwrite local directory '/home/carter/staging' row format delimited fields terminated by ',' select * from hugetable;
    

    The problem with this solution is that the csv obtained won´t have headers and will create a file that is not a CSV (so we have to rename it).

  • As user1922900 said, with the following command we will obtain a CSV files with the results of the query in the specified file and with headers:

    hive -e 'select * from some_table' | sed 's/[\t]/,/g' > /home/yourfile.csv
    

    With this solution we will get a CSV file with the result rows of our query, but with log messages between these rows too. As a solution of this problem I tried this, but without results.

因此,为了解决所有这些问题,我创建了一个脚本,执行一系列查询,创建一个文件夹(带有时间戳)来存储结果,重命名获取的文件,删除不必要的文件,并添加相应的标题。
 #!/bin/sh
 QUERIES=("select * from table1" "select * from table2")
 IFS=""
 directoryname=$(echo "ScriptResults$timestamp")
 mkdir $directoryname 
 counter=1 
for query in ${QUERIES[*]}
 do 
     tablename="query"$counter 
     hive -S -e "INSERT OVERWRITE LOCAL DIRECTORY '/data/2/DOMAIN_USERS/SANUK/users/$USER/$tablename' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' $query ;"
     hive -S -e "set hive.cli.print.header=true; $query limit 1" | head -1 | sed 's/[\t]/,/g' >> /data/2/DOMAIN_USERS/SANUK/users/$USER/$tablename/header.csv
     mv $tablename/000000_0 $tablename/$tablename.csv
     cat $tablename/$tablename.csv >> $tablename/header.csv.
     rm $tablename/$tablename.csv
     mv $tablename/header.csv $tablename/$tablename.csv 
     mv $tablename/$tablename.csv $directoryname
     counter=$((counter+1))
     rm -rf $tablename/ 
 done

1
尝试
hive --outputformat==csv2 -e "select * from YOUR_TABLE";

这对我有用

我的Hive版本是"Hive 3.1.0.3.1.0.0-78"


1

我遇到了类似的问题,以下是我解决的步骤。

第一步 - 将hive表中的数据加载到另一个表中,如下所示

DROP TABLE IF EXISTS TestHiveTableCSV; CREATE TABLE TestHiveTableCSV ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' AS SELECT Column List FROM TestHiveTable;

第二步 - 将blob从hive仓库复制到具有适当扩展名的新位置

Start-AzureStorageBlobCopy -DestContext $destContext -SrcContainer "Source Container" -SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0" -DestContainer "Destination Container" ` -DestBlob "CSV/TestHiveTable.csv"

希望这可以帮到您!

最好的问候, Dattatrey Sindol(Datta) http://dattatreysindol.com


1
如果您正在使用Windows,可以使用Python脚本hivehoney将表数据提取到本地CSV文件中。
它将:
  • 登录堡垒机。
  • pbrun。
  • kinit。
  • beeline(带有您的查询)。
  • 将来自beeline的echo保存到Windows上的文件中。
像这样执行它:
set PROXY_HOST=your_bastion_host

set SERVICE_USER=you_func_user

set LINUX_USER=your_SOID

set LINUX_PWD=your_pwd

python hh.py --query_file=query.sql

0

以上选项都不完全适合我。我想解决一些问题:

  • 如果值中有制表符,它不应该破坏 CSV 输出
  • 我需要头部自动添加,无需手动操作
  • 结构、数组或映射字段应该被 JSON 编码

所以我创建了一个 UDF 来解决这个问题。(有点惊讶 Hive 没有内置支持)

用法:

ADD JAR ivy://org.jsonex:HiveUDF:0.1.24?transitive=true;
CREATE TEMPORARY FUNCTION to_csv AS 'org.jsonex.hiveudf.ToCSVUDF';
SELECT to_csv(*) FROM someTable;  -- Default separator and headers
SELECT to_csv('{noHead:true}', *) FROM someTable;  -- No headers
SELECT to_csv('{headers:[,,,col3,]}', *) FROM someTable; -- Custom Headers
SELECT to_csv('{fieldSep:|,quoteChar:\"\\'\"}', *) FROM someTable" -- Custom fieldSep and quoteChar

0
以下是我用来将Hive表数据导出到HDFS的端到端解决方案,作为一个带标题的CSV文件。
(遗憾的是它不可能用一个HQL语句完成)
它由几个命令组成,但我觉得相当直观,而且它不依赖于Hive表的内部表示,这可能会随时改变。
如果您想将数据导出到本地文件系统而不是HDFS,请将"DIRECTORY"替换为"LOCAL DIRECTORY"。
# cleanup the existing target HDFS directory, if it exists
sudo -u hdfs hdfs dfs -rm -f -r /tmp/data/my_exported_table_name/*

# export the data using Beeline CLI (it will create a data file with a surrogate name in the target HDFS directory)
beeline -u jdbc:hive2://my_hostname:10000 -n hive -e "INSERT OVERWRITE DIRECTORY '/tmp/data/my_exported_table_name' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM my_exported_table_name"

# set the owner of the target HDFS directory to whatever UID you'll be using to run the subsequent commands (root in this case)
sudo -u hdfs hdfs dfs -chown -R root:hdfs /tmp/data/my_exported_table_name

# write the CSV header record to a separate file (make sure that its name is higher in the sort order than for the data file in the target HDFS directory)
# also, obviously, make sure that the number and the order of fields is the same as in the data file
echo 'field_name_1,field_name_2,field_name_3,field_name_4,field_name_5' | hadoop fs -put - /tmp/data/my_exported_table_name/.header.csv

# concatenate all (2) files in the target HDFS directory into the final CSV data file with a header
# (this is where the sort order of the file names is important)
hadoop fs -cat /tmp/data/my_exported_table_name/* | hadoop fs -put - /tmp/data/my_exported_table_name/my_exported_table_name.csv

# give the permissions for the exported data to other users as necessary
sudo -u hdfs hdfs dfs -chmod -R 777 /tmp/data/hive_extr/drivers

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接