Hive性能

Question

Hive性能

hadoophive

6

我在使用Hive，并且我是新手。我在Hive查询的性能方面遇到了一些问题。

尽管有数百个mapper可用，但分配给我的作业的mapper数量非常少。我尝试设置mapred.map.tasks = 200，但只使用了20到30个mapper。我理解，mapper的数量取决于inputsplit。是否有其他选项可以增加mapper？如果没有，那么为什么要引入参数（mapred.map.tasks）?
是否有任何资源可以帮助我理解如何将Hive查询与MapReduce任务相关联，即查询的不同部分在哪里执行？

- bcarthic

1

你的输入数据是如何组织的？在某些情况下，Hive 无法自由地将输入拆分为（理想化的）多个 mapper。例如，如果你正在加载 .gz 文件，则标准行为是 1 个 .gz 文件 -> 1 个 map，而不管可用节点的数量。 - Mike Repass

我正在对一个Hive表进行查询。但是这个表非常大，大约有10TB的大小。 - bcarthic

表格的大小并不重要，@MikeRepass所指的是数据文件的布局。您的表格是一个压缩文件还是由几个文件组成的。有些压缩和文件格式支持压缩。 - Stéphane

3个回答

1

我看到这个问题很久以前就被问过了，尽管有些建议在问题被提出时可能还不可用，但我会尝试回答。

为了优化Hive性能：

调整Hive请求使用的映射器和减速器数量；这可以通过调整每个映射器的输入大小 mapreduce.input.fileinputformat.split.maxsize 和每个减速器的输入大小 hive.exec.reducers.bytes.per.reducer 来完成。

请记住，“越多越好”并不总是正确的。因此，您需要根据自己的需求来调整这些数字。

Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)... (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequently
SELECT * from myTable WHERE someColumn = 'someValue'
it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec

Choosing the right compression codec, for example using Snappy (as in here):

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

在提问时不可用：

Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:
```
<property>
    <name>hive.execution.engine</name>
    <value>tez</value>
</property>
```

为了进一步优化，您可以参考这里。

- user1314742

0

您可以减小'mapreduce.input.fileinputformat.split.maxsize'来增加Mapper的数量（更多的切片）。

- Stéphane

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joe K · Accepted Answer

想要了解有关设置地图任务的更多信息，请查看此链接：http://wiki.apache.org/hadoop/HowManyMapsAndReduces。基本上，mapred.map.tasks只是一个提示；通常它并不能真正控制任何东西。

如果您想要查看Hive查询的执行过程，请在查询之前加上explain。例如：explain select foo from bar;。如果您需要更多信息，还可以使用explain extended。