在Spark DataFrame中展平嵌套数组

3

我正在读取一些JSON数据:

{"a": [{"b": {"c": 1, "d": 2}}]}

也就是说,数组项被不必要地嵌套了。由于这发生在数组内部,所以如何展平Spark数据框中的结构?中给出的答案并不直接适用。

当解析时,数据框的外观如下:

root
|-- a: array
|    |-- element: struct
|    |    |-- b: struct
|    |    |    |-- c: integer
|    |    |    |-- d: integer

我希望您能够将数据框转换为以下格式:
root
|-- a: array
|    |-- element: struct
|    |    |-- b_c: integer
|    |    |-- b_d: integer

如何通过别名将数组中的列拆分以有效解除嵌套?

3个回答

3
您可以使用transform
df2 = df.selectExpr("transform(a, x -> struct(x.b.c as b_c, x.b.d as b_d)) as a")

1
简化方法:
from pyspark.sql.functions import col

def flatten_df(nested_df):
    stack = [((), nested_df)]
    columns = []

    while len(stack) > 0:
        parents, df = stack.pop()

        flat_cols = [
            col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
            for c in df.dtypes
            if c[1][:6] != "struct"
        ]

        nested_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:6] == "struct"
        ]

        columns.extend(flat_cols)

        for nested_col in nested_cols:
            projected_df = df.select(nested_col + ".*")
            stack.append((parents + (nested_col,), projected_df))

    return nested_df.select(columns)

参考:https://learn.microsoft.com/zh-cn/azure/synapse-analytics/how-to-analyze-complex-schema


1

使用已接受的答案中提出的方法,我编写了一个函数来递归展开数据框(同时递归到嵌套数组):

from pyspark.sql.types import ArrayType, StructType

def flatten(df, sentinel="x"):
    def _gen_flatten_expr(schema, indent, parents, last, transform=False):
        def handle(field, last):
            path = parents + (field.name,)
            alias = (
                " as "
                + "_".join(path[1:] if transform else path)
                + ("," if not last else "")
            )
            if isinstance(field.dataType, StructType):
                yield from _gen_flatten_expr(
                    field.dataType, indent, path, last, transform
                )
            elif (
                isinstance(field.dataType, ArrayType) and
                isinstance(field.dataType.elementType, StructType)
            ):
                yield indent, "transform("
                yield indent + 1, ".".join(path) + ","
                yield indent + 1, sentinel + " -> struct("
                yield from _gen_flatten_expr(
                    field.dataType.elementType, 
                    indent + 2, 
                    (sentinel,), 
                    True, 
                    True
                )
                yield indent + 1, ")"
                yield indent, ")" + alias
            else:
                yield (indent, ".".join(path) + alias)

        try:
            *fields, last_field = schema.fields
        except ValueError:
            pass
        else:
            for field in fields:
                yield from handle(field, False)
            yield from handle(last_field, last)

    lines = []
    for indent, line in _gen_flatten_expr(df.schema, 0, (), True):
        spaces = " " * 4 * indent
        lines.append(spaces + line)

    expr = "struct(" + "\n".join(lines) + ") as " + sentinel
    return df.selectExpr(expr).select(sentinel + ".*")

这对我完全没有任何作用。 - Garglesoap
@Garglesoap,你能否将你的问题简化成一个可以在这里分享的简短示例? - malthe
抱歉,我很沮丧。我找到了这样的解决方案:newdf = result.withColumn("sentiment", explode("sentiment")).select("",col("sentiment.")).drop("document","sentence","tokens","word_embeddings","sentence_embeddings","sentiment") - Garglesoap

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接