简介
我正在使用 Hive 1.1.0 和 Cloudera 5.13.3,根据我在 Hue 中运行的解释计划,IN
运算符与等于运算符 (=
) 采用相同的优化方式。
示例
我的表按 LOAD_YEAR (SMALLINT)
和 LOAD_MONTH (TINYINT)
进行分区,有这两个分区:
load_year=2018/load_month=10
(19,828,71 行)
load_year=2018/load_month=11
(702,856 行)
以下是各种查询及其解释计划。
1. 等于 (=
) 运算符
查询:
SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y
解释计划:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
2. IN
运算符
查询(请注意数据中没有12月份):
SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)
解释计划:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
3. 在AND和OR条件下使用等于符号(=
)
查询:
SELECT ID
FROM TBL
WHERE
(LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)
解释计划:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
4. 算术运算
查询:
SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)
小提示:
100没有后缀,因此是一个INT
,(LOAD_YEAR * 100 + LOAD_MONTH)
也是一个INT
。这确保了结果的准确性。由于LOAD_YEAR
是一个SMALLINT
,而LOAD_MONTH
是一个TINYINT
,对两者进行算术计算时,结果使用SMALLINT
,存储的最大值为32,767(不足以表示yyyymm
,它需要6个数字,即至少达到999,999)。将100作为INT
,可以使用INT
类型进行计算,并允许数字高达2,147,483,647。
执行计划:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (201811) IN (201811, 201901) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
摘要
所有这些查询仅扫描第二个分区,从而避免了其他分区中的约2000万行。
between
运算符的使用方式也相同。 - leftjoin