遍历 Pandas Series 元素的最佳方法

Question

遍历 Pandas Series 元素的最佳方法

13

以下所有方法似乎都可以用于遍历pandas Series的元素。我相信还有更多的方法。它们之间有什么区别，哪种方法是最好的？

import pandas


arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

# 1
for el in arr:
    print(el)

# 2
for _, el in arr.iteritems():
    print(el)

# 3
for el in arr.array:
    print(el)

# 4
for el in arr.values:
    print(el)

# 5
for i in range(len(arr)):
    print(arr.iloc[i])

- d.b

8

为什么你需要进行迭代？ - Scott Boston

1

许多关于为什么不应该使用iterrows的论点（https://dev59.com/h2Qn5IYBdhLWcg3w5qlg#55557758）很可能也适用于Series。话虽如此，“最佳方式”是指什么？性能？简洁性？惯用性？ - fsimonjetz

@fsimonjetz，假设惯用语 - d.b

https://dev59.com/h2Qn5IYBdhLWcg3w5qlg#47149876 - Scott Boston

1

如果迭代的目的只是为了打印，那么你使用哪种方法都无所谓。如果你正在进行数值计算，那么我同意@tdy的答案，你应该将其转换为numpy数组并对其进行迭代。顺便说一句，我还回答了另一个问题，使用了numpy循环，同样适用于你的问题（如果它是数值计算）：https://dev59.com/T2sz5IYBdhLWcg3wiISe#65414460 - JohnE

1

为什么你需要迭代呢？这几乎总是不必要的，有.apply()、Series的加和乘等方法。你还没有展示出为什么print()不是一个使用案例。给我们展示一些使用案例吧。 - smci

6个回答

1

使用items:

for i, v in arr.items():
    print(f'index: {i} and value: {v}')

输出：

index: 0 and value: 1
index: 1 and value: 1
index: 2 and value: 1
index: 3 and value: 2
index: 4 and value: 2
index: 5 and value: 2
index: 6 and value: 3
index: 7 and value: 3

- Scott Boston

1

测试结果如下：循环的执行速度最慢。Iterrows() 为 pandas 的数据帧进行了优化，与直接循环相比显著提高。Apply() 方法也在行之间循环，但由于使用了像 Python 迭代器这样的一系列全局优化，因此比 iterrows 更加高效。NumPy 数组的向量化运算最快，其次是 Pandas 系列的向量化运算。由于向量化操作同时作用于整个序列，因此可以节省更多时间。NumPy 使用预编译的 C 代码在底层进行优化，并避免了 Pandas 系列操作中的许多开销。因此，NumPy 数组的操作比 Pandas 系列的操作要快得多。

loop: 1.80301690102 
iterrows: 0.724927186966 
apply: 0.645957946777
pandas series: 0.333024024963 
numpy array: 0.260366916656

循环列表 > numpy数组 > pandas系列 > 应用 > iterrows

- lazy

1

你正在谈论迭代数据框 (Series 甚至没有 .iterrows() 方法)。数据框迭代已经在许多其他 SO 帖子中涵盖过了。 - tdy

1

遍历 Pandas/Python 的方法

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

#Using Python range() method
for i in range(len(arr)):
    print(arr[i])

序列中的范围不包括结束值。

#List Comprehension
print([arr[i] for i in range(len(arr))])

列表推导式可以处理并识别输入是否为列表、字符串或元组。

#Using Python enumerate() method
for el,j in enumerate(arr):
    print(j)
#Using Python NumPy module
import numpy as np
print(np.arange(len(arr)))
for i,j in np.ndenumerate(arr):
    print(j)

enumerate被广泛用作添加计数器到列表或其他可迭代对象并将其作为枚举对象返回的函数。在迭代操作期间，它减少了保持元素计数的开销。这里不需要计数器。您可以使用np.ndenumerate()来模仿numpy数组的枚举行为。对于非常大的n维列表，建议使用numpy。

您还可以使用传统的for循环和while循环。

x=0
while x<len(arr):
    print(arr[x])
    x +=1
    
#Using lambda function
list(map(lambda x:x, arr))

lambda 函数可以减少代码行数，可以与 filter、reduce 或 map 一起使用。

如果您想迭代 dataframe 的行而不是 series，则可以使用 iterrows、itertuple 和 iteritems。从内存和计算的角度来看，最好的方法是将列作为向量，并使用 numpy 数组执行向量计算。当涉及到大数据时，循环非常昂贵。将它们转换为 numpy 数组并对其进行操作会更容易和更快。

- Sonia Samipillai

1

对于向量编程（pandas，R，Octave等），建议不要迭代向量。相反，使用库提供的映射函数在系列或数据集上应用。

在您将打印函数应用于每个元素的情况下，代码将简单地为：

import pandas
arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

arr.apply(print)

- S2L

1

我认为更重要的是在寻找个人需求的解决方案时理解化妆品需求。在我们处理的数据非常庞大时，我们必须选择方法，否则不会花费太多。对于小数据集，可以使用以下任一方法。 PEP 469，PEP 3106和Views And Iterators Instead Of Lists中有很好的解释。

在Python 3中，只有一个名为items()的方法。它使用迭代器，因此速度快，并允许在编辑时遍历字典。请注意，方法iteritems()已从Python 3中删除。

可以查看Python3 Wiki Built-In_Changes以获取更多详细信息。

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
$ for index, value in arr.items():
   print(f"Index : {index}, Value : {value}")

Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for index, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")
   
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for _, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")

Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 3
Index : 7, Value : 3

$ for i, v in enumerate(arr):
   print(f"Index : {i}, Value : {v}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for value in arr:
   print(value)

1
1
1
2
2
2
3
3



$ for value in arr.tolist():
   print(value)

1
1
1
2
2
2
3
3

有一篇关于如何在Pandas中迭代行的好文章（How to iterate over rows in a DataFrame in Pandas），尽管它说的是df，但它解释了所有关于item()，iteritems()等的内容。

另一个很好的讨论是SO items & iteritems。

- Karn Kumar

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tdy · Accepted Answer

简要概述

在pandas中迭代是一种反模式，通常可以通过向量化、应用、聚合、转换或cythonizing来避免。

然而，如果必须进行Series迭代，则性能将取决于dtype和索引：

索引	numpy数据类型最快	pandas数据类型最快	惯用方法
^不需要	^{在 s.to_numpy() 中}	^{在 s.array 中}	^{在 s 中}
^默认	^{使用 enumerate(s.to_numpy())}	^{使用 enumerate(s.array)}	^{使用 s.items()}
^自定义	^{使用 zip(s.index, s.to_numpy())}	^{使用 s.items()}	^{使用 s.items()}

对于基于numpy的Series，请使用s.to_numpy()。

If the Series is a python or numpy dtype, it's usually fastest to iterate the underlying numpy ndarray:
```
for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
```
datetime

int float float + nan str string
To access the index, it's actually fastest to enumerate() or zip() the numpy ndarray:
```
for i, el in enumerate(s.to_numpy()): # if default range index
```
```
for i, el in zip(s.index, s.to_numpy()): # if custom index
```
Both are faster than the idiomatic s.items() / s.iteritems():

datetime + index
To micro-optimize, switch to s.tolist() for shorter int/float/str Series:
```
for el in s.to_numpy(): # if >100K elements
```
```
for el in s.tolist(): # to micro-optimize if <100K elements
```
^{Warning: Do not use list(s) as it doesn't use compiled code which makes it slower.}

对于基于pandas的Series，请使用`s.array`或`s.items()`。

Pandas扩展数据类型包含额外的（元）数据，例如：

pandas数据类型	内容
`Categorical`	2个数组
`DatetimeTZ`	数组+时区元数据
`Interval`	2个数组
`Period`	数组+频率元数据
...	...

将这些扩展数组转换为numpy数组"可能是昂贵的"，因为它可能涉及到复制/强制转换数据，所以：

If the Series is a pandas extension dtype, it's generally fastest to iterate the underlying pandas array:
```
for el in s.array: # if dtype is pandas-only extension
```
For example, with ~100 unique Categorical values:

Categorical

DatetimeTZ Period Interval
To access the index, the idiomatic s.items() is very fast for pandas dtypes:
```
for i, el in s.items(): # if need index for pandas-only dtype
```
DatetimeTZ + index Interval + index Period + index
To micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:
```
for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
```
Categorical + index

注意事项

避免使用 s.values:
- 使用 s.to_numpy() 获取底层的numpy ndarray
- 使用 s.array 获取底层的pandas数组
避免修改迭代的Series:

不要在迭代过程中修改正在被迭代的对象。这并不能保证在所有情况下都能正常工作。根据数据类型，迭代器返回的是副本而不是视图，对其进行写入将没有任何效果！
尽可能避免手动迭代
1. 使用向量化、（布尔）索引等方法。
2. 应用函数, 例如:
  ^{注意：这些不是向量化，尽管有常见的误解。}
3. 转移至cython/numba

_{规格: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)}
_{版本: python==3.9.2, pandas==1.3.1, numpy==1.20.2}
_{测试数据: 片段中的序列生成代码}

'''
Note: This is python code in a js snippet, so "run code snippet" will not work.
The snippet is just to avoid cluttering the main post with supplemental code.
'''

import pandas as pd
import numpy as np

int_series = pd.Series(np.random.randint(1000000000, size=n))
float_series = pd.Series(np.random.randn(size=n))
floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))
str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)
string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')
datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))
datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))
categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')
interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))
period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))

遍历 Pandas Series 元素的最佳方法

简要概述

对于基于pandas的Series，请使用s.array或s.items()。

注意事项

对于基于pandas的Series，请使用`s.array`或`s.items()`。