在Pandas数据框中查找嵌套列

Question

在Pandas数据框中查找嵌套列

9

我有一个大型数据集，其中包含许多以（压缩）JSON格式存储的列。我想将其转换为Parquet格式以进行后续处理。一些列具有嵌套结构。目前，我希望忽略此结构，并将这些列仅作为（JSON）字符串写出。

因此，对于我已经确定的列，我正在执行以下操作：

df[column] = df[column].astype(str)

然而，我不确定哪些列是嵌套的，哪些不是。当我使用parquet写入时，会出现以下消息：

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

这表示我未能将一个嵌套对象列转换为字符串。但是哪一列有问题？我该如何找出？

当我打印出 Pandas 数据帧的 .dtypes 时，我无法区分字符串和嵌套值，因为两者都显示为 object。

编辑：错误会通过显示结构详细信息来提示嵌套列，但这需要耗费相当长的调试时间。而且它只会打印第一个错误，如果你有多个嵌套列，这可能会变得非常麻烦。

- Daniel Kats

2

当您说嵌套列时，是否指任何包含Python对象（list，dict等）的列？并且您想将它们转换为字符串？ - dsmilo

似乎您的数据框中有一些包含C对象的列，这些对象无法被pyarrow.parquet.write_table处理。 "Nested column"是parquet中的术语，在“pandas dataframe”中并没有太多意义。请明确定义这些术语。 - gdlmx

1

也许您可以使用 df.applymap(type) 来获取数据框中每个单元格的类型信息。df.applymap(type).eq(dict).any() 会返回 True，如果每列中至少有一个单元格是字典类型。因此，如果我们使用 df.applymap(type).eq(dict).any()，我们可以过滤掉对应的列。 - ansev

1

@ansev 我不得不处理一个来自Outlook API的流式数据集，其中数据始终在变化，有时带有嵌套和非嵌套列。你的方法与我的非常相似。 - Umar.H

4个回答

1

使用像 infer_dtype() 这样的通用实用函数在 pandas 中，您可以确定该列是否嵌套。

from pandas.api.types import infer_dtype

for col in df.columns:
  if infer_dtype(df[col]) == 'mixed' : 
    # ‘mixed’ is the catchall for anything that is not otherwise specialized
    df[col] = df[col].astype('str')

如果您针对特定的数据类型进行操作，请参见Dtype Introspection。

- Saurabh P Bhandari

1

当我使用Pyspark和流数据集时，遇到了类似的问题，一些列是嵌套的，而另一些则不是。

考虑到你的数据框可能是这样的：

df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
                   'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
                   'C' : [((15,25,87),(22,91))],
                   'D' : 15,
                   'E' : 'A'
                  })


print(df)

                                         A  \
0  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B                         C   D  E  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  ((15, 25, 87), (22, 91))  15  A

我们可以堆叠您的数据框，并使用apply和type获取每列的类型并将其传递给字典。

df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}

使用此方法，我们可以编写一个函数返回嵌套和非嵌套列的元组。

功能

def find_types(dataframe):

    col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
    unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
    nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
    return nested_columns,unnested_columns

在行动中。

nested,unested = find_types(df)

df[unested]

   D  E
0  15  A

print(df[nested])

                          C                                        A  \
0  ((15, 25, 87), (22, 91))  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]

- Umar.H

0

如果您只想查找哪些列是罪魁祸首，那么只需编写一个循环，逐个写入列并存储哪些列失败即可...

bad_cols = []
for i in range(df.shape[1]):
    try:
        df.iloc[:, [i]].to_parquet(...)
    except KeyboardInterrupt:
        raise
    except Exception:  # you may want to catch ArrowInvalid exceptions instead
        bad_cols.append(i)
print(bad_cols)

- Aaron

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gdlmx · Accepted Answer

将嵌套结构转换为字符串

如果我正确理解了您的问题，您希望将df中的嵌套Python对象（列表、字典）序列化为JSON字符串，并保留其他元素不变。最好编写自己的转换方法：

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

如果数据框很大，使用 astype(str) 会更快。

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

这种方法由于在调用any(...)时采用了短路评估，因此具有性能优势。一旦命中列中的第一个嵌套对象，它将立即返回并且不会浪费时间检查其余部分。如果任何“Dtype Introspection”方法适合您的数据，则使用它将更快。

检查pyarrow的最新版本。

我假设那些嵌套结构需要转换为字符串，只是因为它们会导致在pyarrow.parquet.write_table中出错。也许你根本不需要转换，因为处理pyarrow中的嵌套列的问题已经最近得到解决（2020年3月29日，ver 0.17.0）。但是支持可能存在问题，并且正在积极讨论中。