遍历 Pandas 表格列并附加一个字典集合？

Question

遍历 Pandas 表格列并附加一个字典集合？

3

我希望能够循环遍历一个包含1000万行数据的pandas数据框，并将它们添加到一个已经存在的set字典中。

例如，对于这样的一个字典：

x = {10: {1, 2, 3, 5}, 12: {6, 7, 8, 9, 10}}

并且有一个这样的数据框：

d = {'ID': [10, 10, 10, 12, 12, 12], 'Another_ID': [1, 4, 6, 6, 7, 13]}
df = pd.DataFrame(data=d)

ID   Another_ID

10   1
10   4
10   6
12   6
12   7
12   13

我想逐行查看并添加ID“尚未看到”的新值。我想得到这样的结果。

x = {10: {1, 2, 3, 4, 5, 6}, 12: {6, 7, 8, 9, 10, 13}}

我已尝试使用以下简单函数进行迭代。

for i in df [['ID' , 'Another_ID' ]] .values():
    dict[i[0]].add(i[1])

我可以手动输入值，使用以下方式进行操作，但无法在循环中执行！

  dict[10].add(6)

如果有人知道如何循环遍历这两个pandas列并向集合添加新值，请告诉我！请牢记，由于有1000万行数据，因此必须相对快速地完成。谢谢！

- BoomBoxBoy

3个回答

1

一种方法是通过panda的explode函数

out = pd.Series(x).map(list).explode().append(df.set_index('ID')['Another_ID']).groupby(level=0).agg(set).to_dict()
Out[361]: {10: {1, 2, 3, 4, 5, 6}, 12: {6, 7, 8, 9, 10, 13}}

- BENY

1

你可以将你的数据帧视为字典，使用defaultdict从Pandas数据帧中获取数据，然后遍历字典以获取最终输出：

from collections import defaultdict

dd = defaultdict(list)

for ID, another_ID in zip(df.ID, df.Another_ID):
    dd[ID].append(another_ID)

dd

defaultdict(list, {10: [1, 4, 6], 12: [6, 7, 13]})

最终结果：

{key: value.union(dd[key]) for key, value in x.items()}

{10: {1, 2, 3, 4, 5, 6}, 12: {6, 7, 8, 9, 10, 13}}

- sammywemmy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cs95 · Accepted Answer

您可以使用 groupby 和 agg 将 df 转换为与 "x" 类似的格式：

x2 = df.groupby('ID')['Another_ID'].agg(set).to_dict()
print (x2)
# {10: {1, 4, 6}, 12: {6, 7, 13}}

现在，我们可以使用一个简单的表达式将这两个字典进行合并：

x3 = {k: x.get(k, set()) | x2.get(k, set()) for k in x}
print (x3)
# {10: {1, 2, 3, 4, 5, 6}, 12: {6, 7, 8, 9, 10, 13}}

或者，进行原地合并（如果x较大，x2较小，则更有意义）：

for k in x2:
    x[k] = x2[k] | x.get(k, set())

print (x)
# {10: {1, 2, 3, 4, 5, 6}, 12: {6, 7, 8, 9, 10, 13}}

| 运算符表示两个集合操作数的集合并。