使用pandas直接将表格转换成树形结构

Question

使用pandas直接将表格转换成树形结构

pythonpython-3.xpandashdf5

3

我想将这个csv文件格式转换为：

将其转换为具有以下结构的hdf5文件：

我正在使用Pandas。有没有简单的方法可以做到这一点？

- Artur Müller Romanov

你看过这个pandas吗？ - gyx-hh

1

我正在阅读它，但似乎找不到我要找的东西。 - Artur Müller Romanov

@ArturMüllerRomanov，看起来你需要的只是一个嵌套字典。为什么要用HDF5来存储呢？HDF5通常用于大数据或可移植性。 - jpp

@jpp 你觉得这样做不对吗？我的任务是计算在另一个 hdf5 中属于 a、b 或 c 的所有数据集。然而，这些信息是在上面的 csv 文件中提供的。所以我想将 csv 文件转换为 hdf5，并合并两个 hdf5 文件。 - Artur Müller Romanov

1

根据我的经验，HDF5的目的是存储（用于超出内存计算和可移植性）。对于计算本身，如果可能的话，应通过pandas、numpy等在内存中执行。我不知道你的数据有多大，所以无法确定哪种方法适合你。 - jpp

2个回答

3

谢谢，我会看一下 defaultdict。我的解决方案可能更加hacky，但如果有人需要可定制的东西：

import pandas as pd

df = pd.DataFrame([['A', 'a', 'a1'],
                   ['A', 'a', 'a2'],
                   ['A', 'b', 'b1'],
                   ['A', 'b', 'b2'],
                   ['A', 'c', 'c1'],
                   ['A', 'c', 'c2']],
                  columns=['col1', 'col2', 'col3'])

cols = ['col1', 'col2', 'col3']
children = {p : {} for p in cols}
parent = {p : {} for p in cols}

for x in df.iterrows():
    for i in range(len(cols)-1):
        _parent = x[1][cols[i]]
        _child = x[1][cols[i+1]]

        parent[cols[i+1]].update({_child : _parent})
        if _parent in children[cols[i]]:
            children_list = children[cols[i]][_parent]
            children_list.add(_child)
            children[cols[i]].update({_parent : children_list})
        else:
            children[cols[i]].update({_parent : set([_child])})

结果：

    parent =
    {'col1': {},
     'col2': {'a': 'A', 'b': 'A', 'c': 'A'},
     'col3': {'a1': 'a', 'a2': 'a', 'b1': 'b', 'b2': 'b', 'c1': 'c', 'c2': 'c'}}

然后您可以在层级结构中上下移动。

- Cello4ever

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jpp · Accepted Answer

您可以使用 collections.defaultdict 来使用嵌套字典：

from collections import defaultdict
import pandas as pd

# read csv file
# df = pd.read_csv('input.csv', header=None)

df = pd.DataFrame([['A', 'a', 'a1'],
                   ['A', 'a', 'a2'],
                   ['A', 'b', 'b1'],
                   ['A', 'b', 'b2'],
                   ['A', 'c', 'c1'],
                   ['A', 'c', 'c2']],
                  columns=['col1', 'col2', 'col3'])

d = defaultdict(lambda: defaultdict(list))

for row in df.itertuples():
    d[row[1]][row[2]].append(row[3])

结果

defaultdict(<function __main__.<lambda>>,
            {'A': defaultdict(list,
                         {'a': ['a1', 'a2'],
                          'b': ['b1', 'b2'],
                          'c': ['c1', 'c2']})})