如何在分组的Pandas数据框中进行循环？

Question

如何在分组的Pandas数据框中进行循环？

291

数据框：

  c_os_family_ss c_os_major_is l_customer_id_i
0      Windows 7                         90418
1      Windows 7                         90418
2      Windows 7                         90418

代码：

print df
for name, group in df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)):
    print name
    print group

我只想遍历聚合后的数据，但是我收到了错误信息：

ValueError: too many values to unpack

@EdChum，这里是预期的输出：

                                                    c_os_family_ss  \
l_customer_id_i
131572           Windows 7,Windows 7,Windows 7,Windows 7,Window...
135467           Windows 7,Windows 7,Windows 7,Windows 7,Window...

                                                     c_os_major_is
l_customer_id_i
131572           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
135467           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...

输出并不是问题，我希望能够循环遍历每个组。

- Tjorriemorrie

4个回答

135

以下是按列atable分组的pd.DataFrame迭代示例。对于此示例，SQL数据库的“create”语句在for循环中生成：

import pandas as pd

df1 = pd.DataFrame({
    'atable':     ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
    'column':     ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
    'column_type':['varchar', 'varchar', 'int', 'varchar', 'varchar'],
    'is_null':    ['No', 'No', 'Yes', 'No', 'Yes'],
})

df1_grouped = df1.groupby('atable')

# iterate over each group
for group_name, df_group in df1_grouped:
    print('\nCREATE TABLE {}('.format(group_name))

    for row_index, row in df_group.iterrows():
        col = row['column']
        column_type = row['column_type']
        is_null = 'NOT NULL' if row['is_null'] == 'No' else ''
        print('\t{} {} {},'.format(col, column_type, is_null))

    print(");")

- Andrei Sura

21

感谢您展示了如何使用 for row, data in group.iterrows() 迭代遍历一个单独的 group！ - tatlar

2

请确保您阅读了这篇相关的帖子 - https://dev59.com/h2Qn5IYBdhLWcg3w5qlg#55557758 - Andrei Sura

36

如果您的数据框已经被创建，那么您可以迭代索引值。

df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
    print name
    print df.loc[name]

- khiner

1

循环遍历groupby对象

当你对DataFrame/Series进行groupby操作时，你创建了一个pandas.core.groupby.generic.DataFrameGroupBy对象，该对象定义了__iter__()方法，因此可以像其他定义了该方法的对象一样进行迭代。它可以转换为列表/元组/迭代器等。在每次迭代中，它返回一个元组，其中第一个元素是分组键，第二个元素是通过分组创建的数据帧；你可以将其视为对dict_items进行迭代，其中每次迭代时，项是键值对元组。除非你在groupby对象上选择了一个或多个列，否则它将返回数据帧的所有列。以下代码的输出说明了这一点。

import pandas as pd
from IPython.display import display

df = pd.DataFrame({
    'A': ['g1', 'g1', 'g2', 'g1'],
    'B': [1, 2, 3, 4],
    'C': ['a', 'b', 'c', 'd']
})

grouped = df.groupby('A')

list(grouped)         # OK
dict(iter(grouped))   # OK

for x in grouped:
    print(f"    Type of x: {type(x).__name__}\n  Length of x: {len(x)}")
    print(f"Value of x[0]: {x[0]}\n Type of x[1]: {type(x[1]).__name__}")
    display(x[1])

一个非常有用的使用groupby对象的循环的用例是将一个数据框拆分成单独的文件。例如，以下代码从一个数据框中创建了两个csv文件（g_0.csv和g_1.csv）。

for i, (k, g) in enumerate(df.groupby('A')):
    g.to_csv(f"g_{i}.csv")

循环遍历分组的数据框

如上所述，groupby对象通过一个键将数据框分成多个数据框。因此，您可以像处理其他数据框一样循环遍历每个分组的数据框。请参考this answer以获取关于如何迭代遍历数据框的全面方法。最高效的方式可能是使用itertuples()。以下是一个示例，通过对分组的数据框进行循环遍历创建了一个嵌套字典：

out = {}
for k, g in grouped:            # loop over groupby
    out[k] = {}
    for row in g.itertuples():  # loop over dataframe
        out[k][row.B] = row.C
print(out)
# {'g1': {1: 'a', 2: 'b', 4: 'd'}, 'g2': {3: 'c'}}

- cottontail

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- joris · Accepted Answer

df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))已经返回了一个数据框，因此您不能再循环遍历分组。

一般来说：

df.groupby(...) returns a GroupBy object (a DataFrameGroupBy or SeriesGroupBy), and with this, you can iterate through the groups (as explained in the docs here). You can do something like:
```
grouped = df.groupby('A')

for name, group in grouped:
    ...
```
When you apply a function on the groupby, in your example df.groupby(...).agg(...) (but this can also be transform, apply, mean, ...), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the 'split-apply-combine' paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).