如何在Pandas中迭代DataFrame中的行

Question

如何在Pandas中迭代DataFrame中的行

4017

我有一个Pandas数据框，df：

我该如何迭代遍历这个数据框的行？对于每一行，我想通过列名访问其元素（单元格中的值）。例如：

for row in df.rows:
    print(row['c1'], row['c2'])

我找到了一个类似的问题，建议使用以下任一方法：

对于日期和行数据，使用df.T.iteritems()进行循环：

```
使用df.iterrows()进行循环：
```

但我不理解 row 对象是什么以及如何使用它。

- Roman

29

df.iteritems() 迭代的是列而不是行，因此想要迭代行，需要转置（使用“T”操作），将行和列互换位置（对角线反转）。这样，使用 df.T.iteritems() 时，你会有效地遍历原始数据框的每一行。 - Stefan Gruenwald

169

与cs95所说的相反，有很好的理由想要遍历一个数据框，因此新用户不应感到气馁。一个例子是如果您想使用每行的值作为输入来执行一些代码。此外，如果您的数据框相当小（例如少于1000个项目），性能实际上不是问题。 - oulenz

6

在Python中，数据框似乎是默认的表格格式。因此，无论您想读取CSV文件，还是有一个字典列表需要操作其值，或者您想执行简单的连接、分组或窗口操作，都可以使用数据框，即使您的数据相对较小也是如此。 - oulenz

37

我同意@oulenz的观点。据我所知，即使数据集很小，pandas也是读取csv文件的首选。使用API来操作数据更加易于编程。 - F.S.

9

如果您是这个线程的初学者，并且不熟悉pandas库，那么值得退一步评估迭代是否确实是解决问题的方法。在某些情况下，它是有效的。但在大多数情况下，它并不是最佳选择。重要的是通过向他们介绍向量化的概念来帮助初学者了解如何编写“好代码”和“只是能运行的代码”的区别，以及何时使用哪种方法。 - cs95

显示剩余6条评论

34个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Hossein Kalbasi · Answer 1

对于查看和修改值，我会使用iterrows()。在for循环中并使用元组拆包（参见示例：i，row），我仅使用row来查看该值，并在需要修改值时使用i与loc方法。如先前的回答所述，在迭代过程中不应修改正在迭代的内容。

for i, row in df.iterrows():
    df_column_A = df.loc[i, 'A']
    if df_column_A == 'Old_Value':
        df_column_A = 'New_value'

在循环中，row 是行的副本而不是视图。因此，不应编写像 row['A'] = 'New_Value' 这样的代码，它不会修改 DataFrame。但是，您可以使用 i 和 loc 并指定 DataFrame 来完成操作。

- François B. · Answer 2

最简单的方式是使用apply函数。

def print_row(row):
   print row['c1'], row['c2']

df.apply(lambda row: print_row(row), axis=1)

- shubham ranjan · Answer 3

有很多方法可以遍历Pandas数据帧中的行。一个非常简单和直观的方法是:

df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i, 1])

    # For printing more than one columns
    print(df.iloc[i, [0, 2]])

- Ernesto Elsäßer · Answer 4

可能是最优美的解决方案（但肯定不是最有效率的）：

for row in df.values:
    c2 = row[1]
    print(row)
    # ...

for c1, c2 in df.values:
    # ...

注意:

文档明确建议使用 .to_numpy()代替
产生的NumPy数组将具有适合所有列的dtype，在最坏情况下为object
首先不使用循环是有充分理由的

尽管如此，我认为这个选项应该包括在这里，作为解决一个（人们应该认为）微不足道的问题的简单方法。

- Gabriel Staples · Answer 5

关键要点：

1. 使用向量化。 2. 对代码进行速度分析！不要假设某个方法更快，因为你认为它更快；进行速度分析并证明它确实更快。结果可能会让你惊讶。

如何在Pandas的DataFrame中进行迭代而不进行迭代

经过几周的努力，我总结了以下方法：

这里有13种迭代Pandas DataFrame的技巧。正如你所看到的，所需时间差异巨大。最快的技巧比最慢的技巧快了约1363倍！关键是，就像@cs95在这里所说的那样，不要进行迭代！而是使用向量化（"数组编程"）代替。这实际上意味着你应该直接在数学公式中使用数组，而不是手动迭代数组。当然，底层对象必须支持这一点，但Numpy和Pandas都是支持的。

在Pandas中有很多使用向量化的方法，你可以在下面的图表和我的示例代码中看到。当直接使用数组时，底层循环仍然会发生，但是使用的是（我认为）经过优化的底层C代码，而不是原始的Python代码。

结果

测试了13种技术，编号从1到13。每个柱状图下方标有技术编号和名称。每个柱状图上方显示了总计算时间。其下方是乘数，用于显示比最右边的最快技术所花费的时间长多少：

来自我的eRCaGuy_hello_world存储库中的pandas_dataframe_iteration_vs_vectorization_vs_list_comprehension_speed_tests.svg（由此代码生成）。

摘要

列表推导和向量化（可能还包括布尔索引）是你所需要的全部。

使用列表推导（好）和向量化（最佳）。我认为纯向量化总是可能的，但在复杂计算中可能需要额外的工作。在这个答案中搜索"布尔索引"、"布尔数组"和"布尔掩码"（这三者是同一回事），以查看一些更复杂的情况，纯向量化可以在其中使用。

以下是13种技术，按照最快到最慢的顺序列出。我建议永远不要使用最后（最慢）的3到4种技术。

8_pure_vectorization__with_df.loc[]_boolean_array_indexing_for_if_statment_corner_case
6_vectorization__with_apply_for_if_statement_corner_case
7_vectorization__with_list_comprehension_for_if_statment_corner_case
11_list_comprehension_w_zip_and_direct_variable_assignment_calculated_in_place
10_list_comprehension_w_zip_and_direct_variable_assignment_passed_to_func
12_list_comprehension_w_zip_and_row_tuple_passed_to_func
5_itertuples_in_for_loop
13_list_comprehension_w__to_numpy__and_direct_variable_assignment_passed_to_func
9_apply_function_with_lambda
1_raw_for_loop_using_regular_df_indexing
2_raw_for_loop_using_df.loc[]_indexing
4_iterrows_in_for_loop
3_raw_for_loop_using_df.iloc[]_indexing

经验法则：

技巧3、4和2绝对不能使用。它们非常慢，没有任何优势。但要记住：这不是索引技术（如.loc[]或.iloc[]）使这些技巧变糟糕，而是它们所在的for循环使它们变糟糕！例如，我在最快的（纯向量化）方法中使用.loc[]！所以，以下是三种最慢的技巧，绝对不能使用： 1. 3_raw_for_loop_using_df.iloc[]_indexing 2. 4_iterrows_in_for_loop 3. 2_raw_for_loop_using_df.loc[]_indexing

技巧1_raw_for_loop_using_regular_df_indexing也不应使用，但如果你要使用原始for循环，它比其他方法更快。

.apply()函数（9_apply_function_with_lambda）可以使用，但一般情况下我会避免使用它。然而，技巧6_vectorization__with_apply_for_if_statement_corner_case比技巧7_vectorization__with_list_comprehension_for_if_statment_corner_case表现更好，这很有趣。

列表推导式很棒！它不是最快的，但使用起来很简单且非常快！它的好处是可以与任何用于处理单个值或数组值的函数一起使用。这意味着你可以在函数内部使用非常复杂的if语句和其他操作。因此，它在使用外部计算函数时提供了很大的灵活性，代码可读性强，可重复使用，同时仍然具有很高的速度！

向量化是最快且最好的方法，适用于简单的方程。你可以选择仅在方程的更复杂部分使用类似.apply()或列表推导式的方法，同时仍然轻松地使用向量化处理其余部分。

纯向量化是绝对最快且最好的方法，如果你愿意付出努力使其工作的话。 1. 对于简单情况，应使用纯向量化。 2. 对于复杂情况、if语句等，纯向量化也可以通过布尔索引来实现，但可能增加额外的工作量并降低可读性。因此，你可以选择仅对这些边缘情况使用列表推导式（通常是最佳选择）或.apply()（通常较慢，但不总是）来处理计算的其余部分。例如，参见技巧7_vectorization__with_list_comprehension_for_if_statment_corner_case和技巧6_vectorization__with_apply_for_if_statement_corner_case。

测试数据

假设我们有以下的Pandas DataFrame。它有200万行和4列（A，B，C和D），每列的值都是从-1000到1000的随机值：

df =
           A    B    C    D
0       -365  842  284 -942
1        532  416 -102  888
2        397  321 -296 -616
3       -215  879  557  895
4        857  701 -157  480
...      ...  ...  ...  ...
1999995 -101 -233 -377 -939
1999996 -989  380  917  145
1999997 -879  333 -372 -970
1999998  738  982 -743  312
1999999 -306 -103  459  745

我是这样生成这个DataFrame的：

import numpy as np
import pandas as pd

# Create an array (numpy list of lists) of fake data
MIN_VAL = -1000
MAX_VAL = 1000
# NUM_ROWS = 10_000_000
NUM_ROWS = 2_000_000  # default for final tests
# NUM_ROWS = 1_000_000
# NUM_ROWS = 100_000
# NUM_ROWS = 10_000  # default for rapid development & initial tests
NUM_COLS = 4
data = np.random.randint(MIN_VAL, MAX_VAL, size=(NUM_ROWS, NUM_COLS))

# Now convert it to a Pandas DataFrame with columns named "A", "B", "C", and "D"
df_original = pd.DataFrame(data, columns=["A", "B", "C", "D"])
print(f"df = \n{df_original}")

测试方程/计算

我想要展示这些技术在非平凡的函数或方程上是可行的，所以我故意设计了这个需要计算的方程，它需要：

if语句
DataFrame中多列的数据
DataFrame中多行的数据

我们将为每一行计算的方程如下。我随意编写了它，但我认为它足够复杂，你可以在我所做的基础上扩展，以在Pandas中执行任何你想要的方程，并实现完全向量化：

在Python中，上述方程可以这样写：

# Calculate and return a new value, `val`, by performing the following equation:
val = (
    2 * A_i_minus_2
    + 3 * A_i_minus_1
    + 4 * A
    + 5 * A_i_plus_1
    # Python ternary operator; don't forget parentheses around the entire 
    # ternary expression!
    + ((6 * B) if B > 0 else (60 * B))
    + 7 * C
    - 8 * D
)

或者，你可以这样写：

# Calculate and return a new value, `val`, by performing the following equation:

if B > 0:
    B_new = 6 * B
else:
    B_new = 60 * B

val = (
    2 * A_i_minus_2
    + 3 * A_i_minus_1
    + 4 * A
    + 5 * A_i_plus_1
    + B_new
    + 7 * C
    - 8 * D
)

这两个都可以封装成一个函数。例如：

def calculate_val(
        A_i_minus_2,
        A_i_minus_1,
        A,
        A_i_plus_1,
        B,
        C,
        D):
    val = (
        2 * A_i_minus_2
        + 3 * A_i_minus_1
        + 4 * A
        + 5 * A_i_plus_1
        # Python ternary operator; don't forget parentheses around the 
        # entire ternary expression!
        + ((6 * B) if B > 0 else (60 * B))
        + 7 * C
        - 8 * D
    )
    return val

技术方法

完整的代码可以在我的python/pandas_dataframe_iteration_vs_vectorization_vs_list_comprehension_speed_tests.py文件中下载和运行，该文件位于我的eRCaGuy_hello_world存储库中。

以下是所有13种技术的代码：

1_raw_for_loop_using_regular_df_indexing

val = [np.NAN]*len(df)
for i in range(len(df)):
    if i < 2 or i > len(df)-2:
        continue

    val[i] = calculate_val(
        df["A"][i-2],
        df["A"][i-1],
        df["A"][i],
        df["A"][i+1],
        df["B"][i],
        df["C"][i],
        df["D"][i],
    )
df["val"] = val  # put this column back into the dataframe

2_raw_for_loop_using_df.loc[]_indexing

val = [np.NAN]*len(df)
for i in range(len(df)):
    if i < 2 or i > len(df)-2:
        continue

    val[i] = calculate_val(
        df.loc[i-2, "A"],
        df.loc[i-1, "A"],
        df.loc[i,   "A"],
        df.loc[i+1, "A"],
        df.loc[i,   "B"],
        df.loc[i,   "C"],
        df.loc[i,   "D"],
    )

df["val"] = val  # put this column back into the dataframe

3_raw_for_loop_using_df.iloc[]_indexing

# column indices
i_A = 0
i_B = 1
i_C = 2
i_D = 3

val = [np.NAN]*len(df)
for i in range(len(df)):
    if i < 2 or i > len(df)-2:
        continue

    val[i] = calculate_val(
        df.iloc[i-2, i_A],
        df.iloc[i-1, i_A],
        df.iloc[i,   i_A],
        df.iloc[i+1, i_A],
        df.iloc[i,   i_B],
        df.iloc[i,   i_C],
        df.iloc[i,   i_D],
    )

df["val"] = val  # put this column back into the dataframe

4_iterrows_in_for_loop

val = [np.NAN]*len(df)
for index, row in df.iterrows():
    if index < 2 or index > len(df)-2:
        continue

    val[index] = calculate_val(
        df["A"][index-2],
        df["A"][index-1],
        row["A"],
        df["A"][index+1],
        row["B"],
        row["C"],
        row["D"],
    )

df["val"] = val  # put this column back into the dataframe

对于所有下面的示例，我们首先需要通过添加具有前一个和后一个值的列来准备数据框：A_(i-2), A_(i-1)和A_(i+1)。这些列在数据框中将分别命名为A_i_minus_2，A_i_minus_1和A_i_plus_1。

df_original["A_i_minus_2"] = df_original["A"].shift(2)  # val at index i-2
df_original["A_i_minus_1"] = df_original["A"].shift(1)  # val at index i-1
df_original["A_i_plus_1"] = df_original["A"].shift(-1)  # val at index i+1

# Note: to ensure that no partial calculations are ever done with rows which
# have NaN values due to the shifting, we can either drop such rows with
# `.dropna()`, or set all values in these rows to NaN. I'll choose the latter
# so that the stats that will be generated with the techniques below will end
# up matching the stats which were produced by the prior techniques above. ie:
# the number of rows will be identical to before. 
#
# df_original = df_original.dropna()
df_original.iloc[:2, :] = np.NAN   # slicing operators: first two rows, 
                                   # all columns
df_original.iloc[-1:, :] = np.NAN  # slicing operators: last row, all columns

运行上面的向量化代码以生成这3个新列总共花费了0.044961秒。

现在继续介绍其他的技术：

5_itertuples_in_for_loop

val = [np.NAN]*len(df)
for row in df.itertuples():
    val[row.Index] = calculate_val(
        row.A_i_minus_2,
        row.A_i_minus_1,
        row.A,
        row.A_i_plus_1,
        row.B,
        row.C,
        row.D,
    )

df["val"] = val  # put this column back into the dataframe

6_vectorization__with_apply_for_if_statement_corner_case

def calculate_new_column_b_value(b_value):
    # Python ternary operator
    b_value_new = (6 * b_value) if b_value > 0 else (60 * b_value)  
    return b_value_new

# In this particular example, since we have an embedded `if-else` statement
# for the `B` column, pure vectorization is less intuitive. So, first we'll
# calculate a new `B` column using
# **`apply()`**, then we'll use vectorization for the rest.
df["B_new"] = df["B"].apply(calculate_new_column_b_value)
# OR (same thing, but with a lambda function instead)
# df["B_new"] = df["B"].apply(lambda x: (6 * x) if x > 0 else (60 * x))

# Now we can use vectorization for the rest. "Vectorization" in this case
# means to simply use the column series variables in equations directly,
# without manually iterating over them. Pandas DataFrames will handle the
# underlying iteration automatically for you. You just focus on the math.
df["val"] = (
    2 * df["A_i_minus_2"]
    + 3 * df["A_i_minus_1"]
    + 4 * df["A"]
    + 5 * df["A_i_plus_1"]
    + df["B_new"]
    + 7 * df["C"]
    - 8 * df["D"]
)

7_vectorization__with_list_comprehension_for_if_statment_corner_case

# In this particular example, since we have an embedded `if-else` statement
# for the `B` column, pure vectorization is less intuitive. So, first we'll
# calculate a new `B` column using **list comprehension**, then we'll use
# vectorization for the rest.
df["B_new"] = [
    calculate_new_column_b_value(b_value) for b_value in df["B"]
]

# Now we can use vectorization for the rest. "Vectorization" in this case
# means to simply use the column series variables in equations directly,
# without manually iterating over them. Pandas DataFrames will handle the
# underlying iteration automatically for you. You just focus on the math.
df["val"] = (
    2 * df["A_i_minus_2"]
    + 3 * df["A_i_minus_1"]
    + 4 * df["A"]
    + 5 * df["A_i_plus_1"]
    + df["B_new"]
    + 7 * df["C"]
    - 8 * df["D"]
)

8_pure_vectorization__with_df.loc[]_boolean_array_indexing_for_if_statment_corner_case

This uses boolean indexing, AKA: a boolean mask, to accomplish the equivalent of the if statement in the equation. In this way, pure vectorization can be used for the entire equation, thereby maximizing performance and speed.

# If statement to evaluate:
#
#     if B > 0:
#         B_new = 6 * B
#     else:
#         B_new = 60 * B
#
# In this particular example, since we have an embedded `if-else` statement
# for the `B` column, we can use some boolean array indexing through
# `df.loc[]` for some pure vectorization magic.
#
# Explanation:
#
# Long:
#
# The format is: `df.loc[rows, columns]`, except in this case, the rows are
# specified by a "boolean array" (AKA: a boolean expression, list of
# booleans, or "boolean mask"), specifying all rows where `B` is > 0. Then,
# only in that `B` column for those rows, set the value accordingly. After
# we do this for where `B` is > 0, we do the same thing for where `B` 
# is <= 0, except with the other equation.
#
# Short:
#
# For all rows where the boolean expression applies, set the column value
# accordingly.
#
# GitHub CoPilot first showed me this `.loc[]` technique.
# See also the official documentation:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
#
# ===========================
# 1st: handle the > 0 case
# ===========================
df["B_new"] = df.loc[df["B"] > 0, "B"] * 6
#
# ===========================
# 2nd: handle the <= 0 case, merging the results into the
# previously-created "B_new" column
# ===========================
# - NB: this does NOT work; it overwrites and replaces the whole "B_new"
#   column instead:
#
#       df["B_new"] = df.loc[df["B"] <= 0, "B"] * 60
#
# This works:
df.loc[df["B"] <= 0, "B_new"] = df.loc[df["B"] <= 0, "B"] * 60

# Now use normal vectorization for the rest.
df["val"] = (
    2 * df["A_i_minus_2"]
    + 3 * df["A_i_minus_1"]
    + 4 * df["A"]
    + 5 * df["A_i_plus_1"]
    + df["B_new"]
    + 7 * df["C"]
    - 8 * df["D"]
)

9_apply_function_with_lambda

df["val"] = df.apply(
    lambda row: calculate_val(
        row["A_i_minus_2"],
        row["A_i_minus_1"],
        row["A"],
        row["A_i_plus_1"],
        row["B"],
        row["C"],
        row["D"]
    ),
    axis='columns' # same as `axis=1`: "apply function to each row", 
                   # rather than to each column
)

10_list_comprehension_w_zip_and_direct_variable_assignment_passed_to_func

df["val"] = [
    # Note: you *could* do the calculations directly here instead of using a
    # function call, so long as you don't have indented code blocks such as
    # sub-routines or multi-line if statements.
    #
    # I'm using a function call.
    calculate_val(
        A_i_minus_2,
        A_i_minus_1,
        A,
        A_i_plus_1,
        B,
        C,
        D
    ) for A_i_minus_2, A_i_minus_1, A, A_i_plus_1, B, C, D
    in zip(
        df["A_i_minus_2"],
        df["A_i_minus_1"],
        df["A"],
        df["A_i_plus_1"],
        df["B"],
        df["C"],
        df["D"]
    )
]

11_list_comprehension_w_zip_and_direct_variable_assignment_calculated_in_place

df["val"] = [
    2 * A_i_minus_2
    + 3 * A_i_minus_1
    + 4 * A
    + 5 * A_i_plus_1
    # Python ternary operator; don't forget parentheses around the entire
    # ternary expression!
    + ((6 * B) if B > 0 else (60 * B))
    + 7 * C
    - 8 * D
    for A_i_minus_2, A_i_minus_1, A, A_i_plus_1, B, C, D
    in zip(
        df["A_i_minus_2"],
        df["A_i_minus_1"],
        df["A"],
        df["A_i_plus_1"],
        df["B"],
        df["C"],
        df["D"]
    )
]

12_list_comprehension_w_zip_and_row_tuple_passed_to_func

df["val"] = [
    calculate_val(
        row[0],
        row[1],
        row[2],
        row[3],
        row[4],
        row[5],
        row[6],
    ) for row
    in zip(
        df["A_i_minus_2"],
        df["A_i_minus_1"],
        df["A"],
        df["A_i_plus_1"],
        df["B"],
        df["C"],
        df["D"]
    )
]

13_list_comprehension_w__to_numpy__and_direct_variable_assignment_passed_to_func

df["val"] = [
    # Note: you *could* do the calculations directly here instead of using a
    # function call, so long as you don't have indented code blocks such as
    # sub-routines or multi-line if statements.
    #
    # I'm using a function call.
    calculate_val(
        A_i_minus_2,
        A_i_minus_1,
        A,
        A_i_plus_1,
        B,
        C,
        D
    ) for A_i_minus_2, A_i_minus_1, A, A_i_plus_1, B, C, D
        # Note: this `[[...]]` double-bracket indexing is used to select a
        # subset of columns from the dataframe. The inner `[]` brackets
        # create a list from the column names within them, and the outer 
        # `[]` brackets accept this list to index into the dataframe and
        # select just this list of columns, in that order.
        # - See the official documentation on it here:
        #   https://pandas.pydata.org/docs/user_guide/indexing.html#basics
        #   - Search for the phrase "You can pass a list of columns to [] to
        #     select columns in that order."
        #   - I learned this from this comment here:
        #     https://dev59.com/h2Qn5IYBdhLWcg3w5qlg#TYmZZowBVUcc3sd71Izc
        # - One of the **list comprehension** examples in this answer here
        #   uses `.to_numpy()` like this:
        #   https://dev59.com/h2Qn5IYBdhLWcg3w5qlg#55557758
    in df[[
        "A_i_minus_2",
        "A_i_minus_1",
        "A",
        "A_i_plus_1",
        "B",
        "C",
        "D"
    ]].to_numpy()  # NB: `.values` works here too, but is deprecated. See:
                   # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html
]

这里再次呈现结果：

使用预移位行在4个for循环技术中也是一种方法

我想看看去掉这个if检查，使用预移位行在4个for循环技术中是否会有很大的影响：

if i < 2 or i > len(df)-2:
    continue

...所以我用这些修改创建了这个文件：pandas_dataframe_iteration_vs_vectorization_vs_list_comprehension_speed_tests_mod.py。在文件中搜索“MOD:”以找到4种新的修改技术。

只有轻微的改进。现在这17种技术的结果如下，其中4种新技术的名称开头附近有单词_MOD_，紧跟在它们的编号之后。这次是超过500k行，而不是2M：

另请参阅

这个答案也发布在我的个人网站上：https://gabrielstaples.com/python_iterate_over_pandas_dataframe/ https://en.wikipedia.org/wiki/Array_programming - 数组编程，或者称为“向量化”：

在计算机科学中，数组编程指的是允许将操作应用于整个值集的解决方案。这种解决方案通常在科学和工程领域中使用。

支持数组编程的现代编程语言（也称为向量或多维语言）已经专门设计用于将标量操作推广到向量、矩阵和更高维数组。这些语言包括APL、J、Fortran、MATLAB、Analytica、Octave、R、Cilk Plus、Julia、Perl Data Language（PDL）。在这些语言中，对整个数组进行操作的操作可以称为向量化操作，1无论它是否在实现向量指令的向量处理器上执行。在pandas中使用for循环真的很糟糕吗？我应该在什么时候关注它？我的答案 pandas的iterrows方法是否存在性能问题？这个答案我在下面的评论中提到：根据我的结果，我会说，以下是最佳方法的顺序： 1. 向量化， 2. 列表推导， 3. .itertuples()， 4. .apply()， 5. 原始的for循环， 6. .iterrows()。

我没有测试Cython。

- James L. · Answer 6

你还可以使用NumPy索引来进一步提高速度。虽然它并不是真正的迭代，但对于某些应用程序而言，它比迭代工作得更好。

subset = row['c1'][0:5]
all = row['c1'][:]

您可能还希望将其转换为数组。这些索引/选择已经应该像NumPy数组一样运行，但我遇到了问题并需要进行转换。

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file

- cottontail · Answer 7

1. 通过`df.index`迭代并通过`at[]`访问

一种非常易读的方法是通过索引进行迭代（如@Grag2015所建议）。但是，不要像那里使用链式索引，而是使用at以提高效率：

for ind in df.index:
    print(df.at[ind, 'col A'])

相较于 for i in range(len(df)) 的方法，这种方法的优点在于即使索引不是 RangeIndex 也能正常工作。请参考以下示例：

df = pd.DataFrame({'col A': list('ABCDE'), 'col B': range(5)}, index=list('abcde'))

for ind in df.index:
    print(df.at[ind, 'col A'], df.at[ind, 'col B'])    # <---- OK
    df.at[ind, 'col C'] = df.at[ind, 'col B'] * 2      # <---- can assign values
        
for ind in range(len(df)):
    print(df.at[ind, 'col A'], df.at[ind, 'col B'])    # <---- KeyError

如果需要行的整数位置（例如获取上一行的值），请使用enumerate()进行包装：

for i, ind in enumerate(df.index):
    prev_row_ind = df.index[i-1] if i > 0 else df.index[i]
    df.at[ind, 'col C'] = df.at[prev_row_ind, 'col B'] * 2

2. 使用 `get_loc` 和 `itertuples()`

尽管它比 iterrows() 快得多，但 itertuples() 的一个主要缺点是，如果列标签中包含空格（例如 'col C' 变成了 _1 等），它会混淆列标签，这使得在迭代中访问值变得困难。

您可以使用 df.columns.get_loc() 来获取列标签的整数位置，并将其用于索引 namedtuples。请注意，每个 namedtuple 的第一个元素是索引标签，因此为了正确地通过整数位置访问列，您需要将从 get_loc 返回的任何内容加上 1 或在开头对元组进行解包。

df = pd.DataFrame({'col A': list('ABCDE'), 'col B': range(5)}, index=list('abcde'))

for row in df.itertuples(name=None):
    pos = df.columns.get_loc('col B') + 1              # <---- add 1 here
    print(row[pos])


for ind, *row in df.itertuples(name=None):
#   ^^^^^^^^^    <---- unpacked here
    pos = df.columns.get_loc('col B')                  # <---- already unpacked
    df.at[ind, 'col C'] = row[pos] * 2
    print(row[pos])

3. 将其转换为字典并迭代`dict_items`

另一种循环遍历数据框的方法是将其转换为orient='index'中的字典，然后迭代dict_items或dict_values。

df = pd.DataFrame({'col A': list('ABCDE'), 'col B': range(5)})

for row in df.to_dict('index').values():
#                             ^^^^^^^^^         <--- iterate over dict_values
    print(row['col A'], row['col B'])


for index, row in df.to_dict('index').items():
#                                    ^^^^^^^^   <--- iterate over dict_items
    df.at[index, 'col A'] = row['col A'] + str(row['col B'])

这个方法不像iterrows那样混淆数据类型，也不像itertuples那样混淆列标签，并且对于列数是不可知的（如果有很多列，zip(df['col A'], df['col B'], ...)会很快变得笨重）。

最后，正如@cs95所提到的，尽可能避免循环。特别是如果你的数据是数字，如果你深入挖掘一下库，就会有一个针对你任务的优化方法。

话虽如此，有些情况下迭代比向量化操作更有效率。一个常见的这种任务是将pandas dataframe转储为嵌套的json。至少在pandas 1.5.3中，使用itertuples()循环比任何涉及groupby.apply方法的向量化操作要快得多。

- Ashvani Jaiswal · Answer 8

df.iterrows() 返回一个元组(a, b)，其中 a 是 index，b 是 row。

- mjr2000 · Answer 9

这个例子使用iloc在数据帧中隔离每个数字。

import pandas as pd

 a = [1, 2, 3, 4]
 b = [5, 6, 7, 8]

 mjr = pd.DataFrame({'a':a, 'b':b})

 size = mjr.shape

 for i in range(size[0]):
     for j in range(size[1]):
         print(mjr.iloc[i, j])

- gru · Answer 10

免责声明：尽管有很多答案建议不要使用迭代（循环）方法（而我大多数情况下同意），但在以下情况下，我仍然认为这是合理的方法：

从API扩展数据框

假设您有一个包含不完整用户数据的大型数据框。现在，您必须使用其他列扩展此数据，例如用户的age和gender。

这两个值都必须从后端API获取。我假设该API不提供“批处理”端点（可以一次接受多个用户ID）。否则，您应该只调用API一次。

网络请求的成本（等待时间）远远超过了对数据框的迭代。我们谈论的是网络往返时间，达到几百毫秒，与使用替代方法进行迭代的微不足道的收益相比。

每行一个昂贵的网络请求

因此，在这种情况下，我绝对更喜欢使用迭代方法。虽然网络请求很昂贵，但它保证仅针对数据框中的每一行触发一次。以下是使用DataFrame.iterrows的示例：

for index, row in users_df.iterrows():
  user_id = row['user_id']

  # Trigger expensive network request once for each row
  response_dict = backend_api.get(f'/api/user-data/{user_id}')

  # Extend dataframe with multiple data from response
  users_df.at[index, 'age'] = response_dict.get('age')
  users_df.at[index, 'gender'] = response_dict.get('gender')

如何在Pandas中迭代DataFrame中的行

摘要

以下是13种技术，按照最快到最慢的顺序列出。我建议永远不要使用最后（最慢）的3到4种技术。

经验法则：

测试数据

测试方程/计算

技术方法

以下是所有13种技术的代码：

更多关于.iterrtuples()

另请参阅

1. 通过df.index迭代并通过at[]访问

2. 使用 get_loc 和 itertuples()

3. 将其转换为字典并迭代dict_items

从API扩展数据框

每行一个昂贵的网络请求

更多关于`.iterrtuples()`

1. 通过`df.index`迭代并通过`at[]`访问

2. 使用 `get_loc` 和 `itertuples()`

3. 将其转换为字典并迭代`dict_items`