如何在Hive中使用json_tuple返回array<struct>数组

3

我有一个带有json列的hive表。它是orc格式,只有一列包含json字符串。

  • json_column
{
   "type":"REGULAR",
   "period":[
      "ONCE_PER_FOUR_WEEK",
      "ONCE_PER_SIX_WEEK",
      "ONCE_PER_ONE_MONTH",
      "ONCE_PER_TWO_MONTH",
      "ONCE_PER_THREE_MONTH"
   ],
   "count":[
      "4",
      "8",
      "12"
   ],
   "day":[
      "SATURDAY",
      "SUNDAY"
   ],
   "content":[
      {
         "count":"2",
         "value":5,
         "unit":"PERCENT"
      },
      {
         "count":"3",
         "value":10,
         "unit":"PERCENT"
      }
   ]
}

我希望将这一列分成五列。

    type       string,
    period     array<string>,
    count      array<string>,
    day        array<string>,
    content    array<struct<count :string, value :int, unit :string>>

首先,我使用json_tuple将此列分成四列。

SELECT b.type                                            as type,
       b.period                                          as period,
       b.count                                           as count,
       b.deliveryImpossibleDay                           as day,
       b.content                                         as content
FROM sample_table a
         LATERAL VIEW JSON_TUPLE(a.content, 'type', 'period', 'count', 'day',
                                 'content') b
         AS type, period, count, day, content

我需要将内容列更改为结构数组,但是它返回字符串值。

[{"count":"2","value":5,"unit":"PERCENT"},{"count":"3","value":10,"unit":"PERCENT"}]

我该如何将它从 string 转换为 array<struct<count :string, value :int, unit :string>>?有什么想法吗?
1个回答

3

JSON_TUPLE和GET_JSON_OBJECT返回的是字符串。如果不使用自定义UDF,你可以通过解析字符串、拆分、展开和重新组装结构体和数组来转换JSON字符串。

示例:

with sample_table as (
select '{
   "type":"REGULAR",
   "period":[
      "ONCE_PER_FOUR_WEEK",
      "ONCE_PER_SIX_WEEK",
      "ONCE_PER_ONE_MONTH",
      "ONCE_PER_TWO_MONTH",
      "ONCE_PER_THREE_MONTH"
   ],
   "count":[
      "4",
      "8",
      "12"
   ],
   "day":[
      "SATURDAY",
      "SUNDAY"
   ],
   "content":[
      {
         "count":"2",
         "value":5,
         "unit":"PERCENT"
      },
      {
         "count":"3",
         "value":10,
         "unit":"PERCENT"
      }
   ]
}' as content
)

SELECT b.type                                                 as type,
       --to convert to array<string>
       --remove [" and "], split by ","
       split(regexp_replace(b.period,'^\\["|"\\]',''),'","')  as period,
       split(regexp_replace(b.count,'^\\["|"\\]',''),'","')   as count,
       split(regexp_replace(b.day,'^\\["|"\\]',''),'","')     as day,
       --convert to struct and collect array of structs
       collect_list(named_struct('count', x.count, 'value', int(x.value), 'unit', x.unit)) as content    
FROM sample_table a
         LATERAL VIEW JSON_TUPLE(a.content, 'type', 'period', 'count', 'day', 'content') b AS type, period, count, day, content
         LATERAL VIEW explode(split(regexp_replace(b.content,'^\\[|\\]$',''), --remove []
                          '(?<=\\}),(?=\\{)' --split by comma only after } and before {
                         )) e as str_struct
         LATERAL VIEW JSON_TUPLE(e.str_struct,'count','value', 'unit') x as count, value, unit
group by b.type,
       b.period,
       b.count,
       b.day

结果:

type     period                                                                                                     count           day                     content
REGULAR ["ONCE_PER_FOUR_WEEK","ONCE_PER_SIX_WEEK","ONCE_PER_ONE_MONTH","ONCE_PER_TWO_MONTH","ONCE_PER_THREE_MONTH"] ["4","8","12"]  ["SATURDAY","SUNDAY"]   [{"count":"2","value":5,"unit":"PERCENT"},{"count":"3","value":10,"unit":"PERCENT"}]

1
谢谢!它完美地工作了。为我节省了很多时间。 - jeewonb
@leftjoin,如果有相同的样本数据,但是还有一个不同元素数量的命名数组(比如“content”节点),该怎么办呢?例如:在“content”中我们有2个元素,但在新的“content2”节点(数组)中只有1个元素。我尝试使用单独的LETERAL VIEW来处理新节点(“content2”),但结果是每个数组(“content”和“content2”)都有2个元素,而不是第一个有2个,第二个有1个。对此有什么建议吗? - deeplay
忘掉... collect_list(distinct...) 就可以了 :) - deeplay

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接