使用Spark访问嵌套在结构中的JSON数组

Question

使用Spark访问嵌套在结构中的JSON数组

3

我希望对一个深度嵌套的包含数组结构进行算术运算，需要访问不同的字段/子字段。实际上有些数据就在字段名称中（我必须访问的结构是这样创建的，我无能为力）。特别地，我有一组数字作为字段名，我必须使用这些数字，并且这些数字将从一个json文件更改为另一个，因此我必须动态推断这些字段名，然后将它们与子字段值一起使用。

我查看了这个：Access names of fields in struct Spark SQL 不幸的是，我不知道我的结构的字段名称是什么，所以我无法使用它。

我也尝试过这个，看起来很有前途：how to extract the column name and data type from nested struct type in spark 不幸的是，“flatten”函数中的神奇之处是什么，我无法将其调整为字段名而不是字段本身。

这是一个示例json数据集。它表示消费篮：

两个篮子“comp A”和“comp B”都有若干价格作为子字段：compA.‘55.80’是一个价格，compA.‘132.88’是另一个价格等。
我希望将这些单价与其各自子字段中可用数量关联起来：compA.'55.80'.comment[0].qty（500），以及compA.'55.80'.comment[0].qty（600）都应该与55.80关联。compA.'132.88'.comment[0].qty（700）应该与132.88关联。等等。

{"type":"test","name":"john doe","products":{
    "baskets":{
        "comp A":{
            "55.80":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.88":[{"type":"fun","comment":{"qty":700,"text":"hello"}}]
            ,"0.03":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
        }
        ,"comp B":{
            "55.70":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.98":[{"type":"fun","comment":{"qty":300,"text":"hello"}},{"type":"work","comment":{"qty":900,"text":"hello"}}]
            ,"0.01":[{"type":"fun","comment":{"qty":400,"text":"hello"}}]
        }
    }
}}

为了进行一些操作，我希望能够将所有这些数字整理到一个数据框中：

+ -------+---------+----------+
+ basket | price   | quantity +
+ -------+---------+----------+
+ comp A | 55.80   | 500      +
+ comp A | 55.80   | 600      +
+ comp A | 132.88  | 700      +
+ comp A | 0.03    | 500      +
+ comp A | 0.03    | 600      +
+ comp B | 55.70   | 500      +
+ comp B | 55.70   | 600      +
+ comp B | 132.98  | 300      +
+ comp B | 132.98  | 900      +
+ comp B | 0.01    | 400      +
+ -------+---------+----------+

原始数据集的访问方式如下：

scala> myDs
res135: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [products: struct<baskets: struct<compA: struct<55.80: array<struct .....

- sg1234

这里有一个与您相关的问题，它可以为您提供一些指导：https://stackoverflow.com/questions/52525013/how-to-query-nested-json-with-internal-arrays-in-spark-on-basis-of-equality-chec/52534617#52534617。需要使用explode函数。 - thebluephantom

这方面有什么进展了吗？ - thebluephantom

我确实了解模式的很多内容：products.baskets.compA包含一个未知字段列表，所有这些字段都在其字段名称中具有价格。然后，在每个字段下面是一个数组。我尝试了字段名称函数，通过调整描述中提供的第二个链接，这是动态模式的良好起点... - sg1234

从我所看到的来看，这确实是第一次，不确定该如何处理。 - thebluephantom

我有一些线索，但是我无法解决这个问题。基于字段命名的变量不能推断出任何信息。我建议更改格式。很感兴趣看看谁能解决。祝成功。 - thebluephantom

显示剩余5条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- thebluephantom · Accepted Answer

1

这种以列名作为数据处理的方法不可取，它根本行不通。

- thebluephantom

我还没有完全放弃（因为我对数据格式没有控制权，而且数据已经存在...）。我会随着进展发布更新。 - sg1234

好事，但真的很艰苦。 - thebluephantom

1

这个答案说这是不可能的，但没有说明为什么不可能。实际上，我非常确定这是可能的 => 我希望有人（也许是我）找到真正的答案并在这里发布以记录/帮助他人。 - sg1234

那么我期待有一天能得到答案。我注意到这个平台上有很多比我更优秀的专家，但迄今为止还没有回应。 - thebluephantom

解决了吗？如果解决了，请发布。 - thebluephantom

事情进展如何？ - thebluephantom