Hive Explode / Lateral View多个数组

Question

Hive Explode / Lateral View多个数组

22

我有一个包含以下架构的Hive表：

COOKIE  | PRODUCT_ID | CAT_ID |    QTY    
1234123   [1,2,3]    [r,t,null]  [2,1,null]

我该如何规范化数组以获得以下结果

COOKIE  | PRODUCT_ID | CAT_ID |    QTY

1234123   [1]          [r]         [2]

1234123   [2]          [t]         [1] 

1234123   [3]          null       null

我尝试了以下方法：

select concat_ws('|',visid_high,visid_low) as cookie
,pid
,catid 
,qty
from table
lateral view explode(productid) ptable as pid
lateral view explode(catalogId) ptable2 as catid 
lateral view explode(qty) ptable3 as qty

然而，结果却呈现为笛卡尔积。

- user2726995

5个回答

16

您可以使用Brickhouse（http://github.com/klout/brickhouse）中的numeric_range和array_index UDF来解决此问题。有一篇信息丰富的博客文章详细描述了http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/。

使用这些UDF，查询将类似于：

select cookie,
   array_index( product_id_arr, n ) as product_id,
   array_index( catalog_id_arr, n ) as catalog_id,
   array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;

- Jerome Banks

@Jerome...如果数组大小不同，这能行得通吗？ - E B

我不确定不同的数组大小是否有意义。然后你需要检查n是否大于当前数组。类似于SELECT cookie，IF（n> = size（array1），array_index（array1，n），null），IF（n> = size（array2），array_index（array2，n）..... - Jerome Banks

15

你可以使用posexplode来实现此操作，它会为数组中的每个元素提供一个介于0和n之间的整数以指示其在数组中的位置。然后使用这个整数（称为位置）通过块符号获取其他数组中的匹配值，像这样：

select 
  cookie, 
  n.pos as position, 
  n.prd_id as product_id,
  cat_id[pos] as catalog_id,
  qty[pos] as qty
from table
lateral view posexplode(product_id_arr) n as pos, prd_id;

这样做可以避免使用导入的UDF，以及将各种数组连接在一起（这样做性能更好）。

- dataMD

1

这个回答值得更多的投票！虽然使用多个posexplode的替代方案可能适用于较小的表格，但对于较大的表格和需要传递给posexplode的较大变量数量，这是正确的方法。 - runr

感觉这应该是最佳答案。更快、更干净且可扩展。 - John F

1

如果您在pyspark中使用Spark 2.4，请使用posexplode与arrays_zip。

df = (df
    .withColumn('zipped', arrays_zip('col1', 'col2'))
    .select('id', posexplode('zipped')))

- ehacinom

0

我尝试着针对您的情况进行编程... 请尝试这段代码 -

create table info(cookie string,productid int,catid string,qty string);

insert into table info
select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table
lateral view posexplode(productid) pro as myprod,pro
lateral view posexplode(categoryid) cate as mycat,cate
lateral view posexplode(qty) q as myqty,q
where myprod=mycat and mycat=myqty;

注意 - 在上述语句中，如果您将select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table替换为select cookie,myprod,mycat,myqty from table，则输出结果将会是productid、categoryid和qty数组中元素的索引。希望这对您有所帮助。

- Lakshman Purihella

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ahmed Abdellatif · Accepted Answer

我找到了一种不使用任何UDF的非常好的解决方案，posexplode 是一个非常好的解决方案：

SELECT COOKIE,
ePRODUCT_ID,
eCAT_ID,
eQTY
FROM TABLE 
LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_ID AS seqp, ePRODUCT_ID
LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
WHERE seqp = seqc AND seqc = seqq;