我同意@10465355的回答。这里我有一个极端例子。
数据处理
有一个名为table_a的表,其所有列都是字符串。它的存储格式是Orc,并且是由以下程序生成的:
insert overwrite table table_a
select a,b,...,i
from table_other
group by a,b,...,i
在 HashAggregate 操作之后,table_a 表中的数据已经足够有组织了。特别是第一列 a
。Orc 文件大小为 6.97 MB。(实际上还有一个小文件,大小为 2.09 KB,但我后来忽略了它。)
接下来,我们对 table_a 进行了 repartition
。
val querydf = spark.sql("""select *
from table_a distribute by rand()""").repartition(1)
querydf.createOrReplaceTempView("tmpTable")
spark.sql("""insert overwrite table table_a
select a,b,...,i
from tmpTable""")
当
numpartitions
=1 时,
Random(hashing.byteswap32(index)).nextInt(numPartitions)
不会触发随机重分配。因此,我们添加
distribute by rand()
相当于
repartition(n)
,并获得一个大小为14.26 MB的文件。
结果
我们可以使用 hive --orcfiledump
获取orc文件的文件结构。
repartition
之前:
Stripes:
Stripe: offset: 3 data: 7288854 rows: 668265 tail: 354 index: 13637
Stream: column 0 section ROW_INDEX start: 3 length 50
Stream: column 1 section ROW_INDEX start: 53 length 1706
Stream: column 2 section ROW_INDEX start: 1759 length 672
Stream: column 3 section ROW_INDEX start: 2431 length 2297
Stream: column 4 section ROW_INDEX start: 4728 length 1638
Stream: column 5 section ROW_INDEX start: 6366 length 1270
Stream: column 6 section ROW_INDEX start: 7636 length 1887
Stream: column 7 section ROW_INDEX start: 9523 length 1823
Stream: column 8 section ROW_INDEX start: 11346 length 1120
Stream: column 9 section ROW_INDEX start: 12466 length 1174
Stream: column 1 section DATA start: 13640 length 209662
Stream: column 1 section LENGTH start: 223302 length 1158
Stream: column 1 section DICTIONARY_DATA start: 224460 length 231328
Stream: column 2 section DATA start: 455788 length 29861
Stream: column 2 section LENGTH start: 485649 length 5
Stream: column 2 section DICTIONARY_DATA start: 485654 length 33
Stream: column 3 section DATA start: 485687 length 424936
Stream: column 3 section LENGTH start: 910623 length 4069
Stream: column 3 section DICTIONARY_DATA start: 914692 length 41298
Stream: column 4 section DATA start: 955990 length 443602
Stream: column 4 section LENGTH start: 1399592 length 4122
Stream: column 4 section DICTIONARY_DATA start: 1403714 length 56217
Stream: column 5 section DATA start: 1459931 length 475983
Stream: column 5 section LENGTH start: 1935914 length 2650
Stream: column 5 section DICTIONARY_DATA start: 1938564 length 17798
Stream: column 6 section DATA start: 1956362 length 480891
Stream: column 6 section LENGTH start: 2437253 length 4230
Stream: column 6 section DICTIONARY_DATA start: 2441483 length 27873
Stream: column 7 section DATA start: 2469356 length 2716359
Stream: column 7 section LENGTH start: 5185715 length 304679
Stream: column 8 section DATA start: 5490394 length 438723
Stream: column 8 section LENGTH start: 5929117 length 58072
Stream: column 8 section DICTIONARY_DATA start: 5987189 length 424961
Stream: column 9 section DATA start: 6412150 length 630248
Stream: column 9 section LENGTH start: 7042398 length 1455
Stream: column 9 section DICTIONARY_DATA start: 7043853 length 258641
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[48184]
Encoding column 2: DICTIONARY_V2[3]
Encoding column 3: DICTIONARY_V2[4252]
Encoding column 4: DICTIONARY_V2[4398]
Encoding column 5: DICTIONARY_V2[4404]
Encoding column 6: DICTIONARY_V2[5553]
Encoding column 7: DIRECT_V2
Encoding column 8: DICTIONARY_V2[105667]
Encoding column 9: DICTIONARY_V2[60943]
重新分区后:
Stripes:
Stripe: offset: 3 data: 14940022 rows: 668284 tail: 344 index: 12312
Stream: column 0 section ROW_INDEX start: 3 length 50
Stream: column 1 section ROW_INDEX start: 53 length 1755
Stream: column 2 section ROW_INDEX start: 1808 length 678
Stream: column 3 section ROW_INDEX start: 2486 length 1815
Stream: column 4 section ROW_INDEX start: 4301 length 1297
Stream: column 5 section ROW_INDEX start: 5598 length 1217
Stream: column 6 section ROW_INDEX start: 6815 length 1841
Stream: column 7 section ROW_INDEX start: 8656 length 1330
Stream: column 8 section ROW_INDEX start: 9986 length 1289
Stream: column 9 section ROW_INDEX start: 11275 length 1040
Stream: column 1 section DATA start: 12315 length 4260547
Stream: column 1 section LENGTH start: 4272862 length 15955
Stream: column 2 section DATA start: 4288817 length 102153
Stream: column 2 section LENGTH start: 4390970 length 5
Stream: column 2 section DICTIONARY_DATA start: 4390975 length 33
Stream: column 3 section DATA start: 4391008 length 1033345
Stream: column 3 section LENGTH start: 5424353 length 4069
Stream: column 3 section DICTIONARY_DATA start: 5428422 length 41298
Stream: column 4 section DATA start: 5469720 length 1044769
Stream: column 4 section LENGTH start: 6514489 length 4122
Stream: column 4 section DICTIONARY_DATA start: 6518611 length 56217
Stream: column 5 section DATA start: 6574828 length 1142805
Stream: column 5 section LENGTH start: 7717633 length 2650
Stream: column 5 section DICTIONARY_DATA start: 7720283 length 17798
Stream: column 6 section DATA start: 7738081 length 1147888
Stream: column 6 section LENGTH start: 8885969 length 4230
Stream: column 6 section DICTIONARY_DATA start: 8890199 length 27873
Stream: column 7 section DATA start: 8918072 length 1705640
Stream: column 7 section LENGTH start: 10623712 length 208184
Stream: column 7 section DICTIONARY_DATA start: 10831896 length 1525605
Stream: column 8 section DATA start: 12357501 length 513225
Stream: column 8 section LENGTH start: 12870726 length 58100
Stream: column 8 section DICTIONARY_DATA start: 12928826 length 424905
Stream: column 9 section DATA start: 13353731 length 1338510
Stream: column 9 section LENGTH start: 14692241 length 1455
Stream: column 9 section DICTIONARY_DATA start: 14693696 length 258641
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DICTIONARY_V2[3]
Encoding column 3: DICTIONARY_V2[4252]
Encoding column 4: DICTIONARY_V2[4398]
Encoding column 5: DICTIONARY_V2[4404]
Encoding column 6: DICTIONARY_V2[5553]
Encoding column 7: DICTIONARY_V2[378283]
Encoding column 8: DICTIONARY_V2[105678]
Encoding column 9: DICTIONARY_V2[60943]
Orc使用Run-length编码和字典编码来压缩数据。这里是编码DICTIONARY_V2的含义。REF:
ORCv1
| ENCODING | STREAM KIND | OPTIONAL | CONTENTS |
| -------- | -------------- | -------- | -------------- |
| DICTIONARY_V2| PRESENT | 是 | Boolean RLE |
| | DATA | 否| 无符号整数RLE v2 |
| | DICTIONARY_DATA | 否 | 字符串内容 |
| | LENGTH | 否 | 无符号整数RLE v2 |
在字典编码中,如果值为[“Nevada”,“California”,“Nevada”,“California”和“Florida”],则DICTIONARY_DATA将是“CaliforniaFloridaNevada”,LENGTH将是[10,7,6],DATA将是[2,0,2,0,1]。
而无符号整数RLE v2也在REF中:
ORCv1
在Hive 0.12中,ORC引入了Run Length Encoding版本2(RLEv2),其具有改进的压缩和固定位宽编码,以加快扩展速度。 RLEv2基于数据使用四个子编码:
- 短重复 - 用于短序列中重复的值
- 直接 - 用于具有固定位宽的随机序列
- 修补基准 - 用于具有可变位宽的随机序列
- Delta - 用于单调递增或递减的序列
让我们聚焦于第一列。
Stream: column 1 section DATA start: 13640 length 209662
Stream: column 1 section LENGTH start: 223302 length 1158
Stream: column 1 section DICTIONARY_DATA start: 224460 length 231328
Encoding column 1: DICTIONARY_V2[48184]
Stream: column 1 section DATA start: 12315 length 4260547
Stream: column 1 section LENGTH start: 4272862 length 15955
Encoding column 1: DIRECT_V2
虽然我不知道Orc如何选择ENCODING,但是Orc认为在随机化后使用DIRECT_V2
对于列1可以比使用DICTIONARY_V2节省更多的空间。实际上,在重新分区之后,空间几乎增加了10倍。(4260547+15955)/(209662+1158+231328)
其他列的大多数ENCODING没有改变,但大小已经增加。
比较
repartition
与coalesce
:
前者的文件大小是均匀的,以避免数据倾斜。
前者的数据大小变得更大。
*(潜在)*当过滤混沌数据时,ORC的行组索引不能使用。
在连接时,两者都需要再次洗牌。我使用上述数据进行测试,发现洗牌和排序之间的时间没有显着差异。
coalsece(2000)
生成了2000个文件。 - Powers