如何在PySpark中创建包含结构列的数据框，而无需指定模式？

Question

如何在PySpark中创建包含结构列的数据框，而无需指定模式？

apache-sparkpysparkstructapache-spark-sqlpyspark-schema

3

我正在学习PySpark，能够快速创建示例数据框架以尝试PySpark API的功能非常方便。

以下代码（其中spark是一个Spark会话）：

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
df = spark.createDataFrame(df)
df.printSchema()

提供一张地图（并没有正确解释数组）：

root
 |-- data: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- id: long (nullable = true)

我需要一个结构体。如果我提供一个模式，我可以强制使用结构体：

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
schema = T.StructType([
    T.StructField('id', LongType()),
    T.StructField('data', StructType([
        StructField('x', T.StringType()),
        StructField('y', T.ArrayType(T.LongType())),
    ]) )
])
df = spark.createDataFrame(df, schema=schema)
df.printSchema()

这确实给出了：

root
 |-- id: long (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: array (nullable = true)
 |    |    |-- element: long (containsNull = true)

但是这样键入太多了。

有没有其他快速创建dataframe的方法，使数据列成为结构体而不需要指定模式？

- karpan

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ZygD · Accepted Answer

在创建示例dataframe时，您可以使用Python的元组，这将被转换为Spark的结构体。但是，这种方式不能指定结构体字段名称。

df = spark.createDataFrame(
    [(1, ('mplah', [10,20,30])),
     (2, ('mplah2', [100,200,300]))],
    ['id', 'data']
)
df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = true)
#  |    |-- _1: string (nullable = true)
#  |    |-- _2: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)

使用这种方法，您可能想要添加模式:

df = spark.createDataFrame(
    [(1, ('mplah', [10,20,30])),
     (2, ('mplah2', [100,200,300]))],
    'id: bigint, data: struct<x:string,y:array<bigint>>'
)
df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = true)
#  |    |-- x: string (nullable = true)
#  |    |-- y: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)

然而，我常常喜欢采用使用struct的方法。这种方法不提供详细的模式，而是从列名中获取结构字段名。

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'mplah', [10,20,30]),
     (2, 'mplah2', [100,200,300])],
    ['id', 'x', 'y']
)
df = df.select('id', F.struct('x', 'y').alias('data'))

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = false)
#  |    |-- x: string (nullable = true)
#  |    |-- y: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)