PySpark:从字符串类型列中读取嵌套的JSON并创建列

8

我有一个PySpark的数据框,其中包含3列 - json、date和object_id:

-----------------------------------------------------------------------------------------
|json                                                              |date      |object_id|
-----------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-01|xyz123   |
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-02|xyz123   |
|{'g':{'h':0,'j':{'50':0.005,'80':0,'100':0},'d':0.02}}            |2020-08-03|xyz123   |
-----------------------------------------------------------------------------------------

现在我有一个变量列表:[a.c.60, a.n.60, a.d, g.h]。我需要从上述DataFrame的JSON列中提取这些变量,并将这些变量作为列添加到DataFrame中,以及它们各自的值。
因此,最终DataFrame应该如下所示:
-------------------------------------------------------------------------------------------------------
|json                                                    |date      |object_id|a.c.60|a.n.60|a.d |g.h|
-------------------------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-01|xyz123   |0     |null  |0.01|null|
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-02|xyz123   |null  |0     |0.01|null|
|{'g':{'h':0,'j':{'k':0.005,'':0,'100':0},'d':0.01}}     |2020-08-03|xyz123   |null  |null  |0.02|0   |
-------------------------------------------------------------------------------------------------------

请帮忙获取这个结果数据框。我面临的主要问题是由于传入的JSON数据没有固定的结构。JSON数据可以是任何嵌套形式,但是我需要提取给定的四个变量。我已经在Pandas中通过展开JSON字符串并提取4个变量来实现了这一点,但在Spark中变得困难起来。
1个回答

15

有 2 种方法可以做到:

  1. 使用get_json_object函数,像这样:
import pyspark.sql.functions as F

df = spark.createDataFrame(['{"a":{"b":0,"c":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"a":{"m":0,"n":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"g":{"h":0,"j":{"50":0.005,"80":0,"100":0},"d":0.02}}'],
                           StringType())

df3 = df.select(F.get_json_object(F.col("value"), "$.a.c.60").alias("a_c_60"),
                F.get_json_object(F.col("value"), "$.a.n.60").alias("a_n_60"),
                F.get_json_object(F.col("value"), "$.a.d").alias("a_d"),
                F.get_json_object(F.col("value"), "$.g.h").alias("g_h"))

会给:

>>> df3.show()
+------+------+----+----+
|a_c_60|a_n_60| a_d| g_h|
+------+------+----+----+
|     0|  null|0.01|null|
|  null|     0|0.01|null|
|  null|  null|null|   0|
+------+------+----+----+
  1. 通过显式声明模式(仅包括必要字段),使用from_json函数将JSON转换为结构体,并从结构体中提取单个值 - 这可能比使用JSON Path更高效:
from pyspark.sql.types import *
import pyspark.sql.functions as F

aSchema = StructType([
    StructField("c", StructType([
        StructField("60", DoubleType(), True)
    ]), True),
    StructField("n", StructType([
        StructField("60", DoubleType(), True)
    ]), True),
    StructField("d", DoubleType(), True),
])
gSchema = StructType([
    StructField("h", DoubleType(), True)
])

schema = StructType([
    StructField("a", aSchema, True),
    StructField("g", gSchema, True)
])

df = spark.createDataFrame(['{"a":{"b":0,"c":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"a":{"m":0,"n":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"g":{"h":0,"j":{"50":0.005,"80":0,"100":0},"d":0.02}}'],
                           StringType())

df2 = df.select(F.from_json("value", schema=schema).alias('data')).select('data.*')
df2.select(df2.a.c['60'], df2.a.n['60'], df2.a.d, df2.g.h).show()

会给予

+------+------+----+----+
|a.c.60|a.n.60| a.d| g.h|
+------+------+----+----+
|   0.0|  null|0.01|null|
|  null|   0.0|0.01|null|
|  null|  null|null| 0.0|
+------+------+----+----+

谢谢Alex Ott,第一种方法对我有用。虽然我没有尝试第二种方法,但非常感谢你的帮助 :) - arin1405
嗨@Alex Ott如果数据框架是这样的(字典列表),那么代码将是什么: df = spark.createDataFrame(['[{"a":{"b":0,"c":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}]', '[{"a":{"m":0,"n":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}]', '[{"g":{"h":0,"j":{"50":0.005,"80":0,"100":0},"d":0.02}}]'], StringType()) - Suraj Tripathi

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接