如何在PySpark中为嵌套的JSON列创建模式？

Question

如何在PySpark中为嵌套的JSON列创建模式？

jsonapache-sparkpysparkschemapyspark-schema

3

我有一个包含多列的Parquet文件，其中有两列是JSON/Struct类型，但它们的类型是字符串。可能存在任意数量的array_elements。

{
  "addressline": [

    {
      "array_element": "F748DK’8U1P9’2ZLKXE"
    },
    {
      "array_element": "’O’P0BQ04M-"
    },
    {
      "array_element": "’fvrvrWEM-"
    }

  ],
  "telephone": [
    {
      "array_element": {
        "locationtype": "8.PLT",
        "countrycode": null,
        "phonenumber": "000000000",
        "phonetechtype": "1.PTT",
        "countryaccesscode": null,
        "phoneremark": null
      }
    }
  ]
}

我该如何在PySpark中创建模式以处理这些列？

- naga satish

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ZygD · Accepted Answer

将您提供的示例视为字符串，我创建了这个数据框：

from pyspark.sql import functions as F, types as T
df = spark.createDataFrame([('{"addressline":[{"array_element":"F748DK’8U1P9’2ZLKXE"},{"array_element":"’O’P0BQ04M-"},{"array_element":"’fvrvrWEM-"}],"telephone":[{"array_element":{"locationtype":"8.PLT","countrycode":null,"phonenumber":"000000000","phonetechtype":"1.PTT","countryaccesscode":null,"phoneremark":null}}]}',)], ['c1'])

这是要应用于此列的模式：

schema = T.StructType([
    T.StructField('addressline', T.ArrayType(T.StructType([
        T.StructField('array_element', T.StringType())
    ]))),
    T.StructField('telephone', T.ArrayType(T.StructType([
        T.StructField('array_element', T.StructType([
            T.StructField('locationtype', T.StringType()),
            T.StructField('countrycode', T.StringType()),
            T.StructField('phonenumber', T.StringType()),
            T.StructField('phonetechtype', T.StringType()),
            T.StructField('countryaccesscode', T.StringType()),
            T.StructField('phoneremark', T.StringType()),
        ]))
    ])))
])

提供模式给from_json函数的结果：

df = df.withColumn('c1', F.from_json('c1', schema))

df.show()
# +-------------------------------------------------------------------------------------------------------+
# |c1                                                                                                     |
# +-------------------------------------------------------------------------------------------------------+
# |{[{F748DK’8U1P9’2ZLKXE}, {’O’P0BQ04M-}, {’fvrvrWEM-}], [{{8.PLT, null, 000000000, 1.PTT, null, null}}]}|
# +-------------------------------------------------------------------------------------------------------+

df.printSchema()
# root
#  |-- c1: struct (nullable = true)
#  |    |-- addressline: array (nullable = true)
#  |    |    |-- element: struct (containsNull = true)
#  |    |    |    |-- array_element: string (nullable = true)
#  |    |-- telephone: array (nullable = true)
#  |    |    |-- element: struct (containsNull = true)
#  |    |    |    |-- array_element: struct (nullable = true)
#  |    |    |    |    |-- locationtype: string (nullable = true)
#  |    |    |    |    |-- countrycode: string (nullable = true)
#  |    |    |    |    |-- phonenumber: string (nullable = true)
#  |    |    |    |    |-- phonetechtype: string (nullable = true)
#  |    |    |    |    |-- countryaccesscode: string (nullable = true)
#  |    |    |    |    |-- phoneremark: string (nullable = true)