获取 Pandas 布尔系列为 True 的索引列表

Question

获取 Pandas 布尔系列为 True 的索引列表

pythonpandasbooleanseriesboolean-indexing

92

我有一个包含布尔值的Pandas序列。我想要获取所有值为True的索引列表。

例如，输入pd.Series([True, False, True, True, False, False, False, True])

应该输出[0,2,3,7]。

我可以使用列表推导来实现，但是否有更简洁或更快速的方法？

- James McKeown

3

一个更好的测试用例是s = pd.Series([True, False, True, True, False, False, False, True], index=list('ABCDEFGH'))。期望输出为Index(['A', 'C', 'D', 'H'], ...)。由于一些解决方案（尤其是所有np函数）会删除索引并使用自动编号索引。 - smci

如果我们有一个命名索引，通常情况下删除它是非常不可取的。 - smci

4个回答

29

作为对rafaelc的回答的补充，以下是按照速度从快到慢排列的相应设置时间。

import numpy as np
import pandas as pd
s = pd.Series([x > 0.5 for x in np.random.random(size=1000)])

使用`np.where`

>>> timeit np.where(s)[0]
12.7 µs ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

使用`np.flatnonzero`

>>> timeit np.flatnonzero(s)
18 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

使用`pd.Series.index`

对我来说，布尔索引的时间差异真是令人惊讶，因为布尔索引通常更常用。

>>> timeit s.index[s]
82.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

使用布尔索引

>>> timeit s[s].index
1.75 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如果你需要一个 `np.array` 对象，可以使用 `.values` 方法获取它。

>>> timeit s[s].index.values
1.76 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如果您需要一個稍微更容易閱讀的版本。

>>> timeit s[s==True].index
1.89 ms ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

使用 `pd.Series.where` <-- 不在原始答案中

>>> timeit s.where(s).dropna().index
2.22 ms ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> timeit s.where(s == True).dropna().index
2.37 ms ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用`pd.Series.mask` <-- 不在原始答案中

>>> timeit s.mask(s).dropna().index
2.29 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> timeit s.mask(s == True).dropna().index
2.44 ms ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用列表推导式

>>> timeit [i for i in s.index if s[i]]
13.7 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用Python内置的filter函数

>>> timeit [*filter(s.get, s.index)]
14.2 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用np.nonzero <-- 对我来说并不是一蹴而就的

>>> timeit np.nonzero(s)
ValueError: Length of passed values is 1, index implies 1000.

使用np.argwhere <-- 对我来说并不起作用

>>> timeit np.argwhere(s).ravel()
ValueError: Length of passed values is 1, index implies 1000.

- Christian Steinmeyer

3

同样有效：s.where(lambda x: x).dropna().index，而且它具有易于链式操作的优点——如果您的系列是即时计算的，则不需要将其分配给变量。

请注意，如果s是从r计算出来的：s = cond(r)，那么您也可以使用：r.where(lambda x: cond(x)).dropna().index。

- tsvikas

“它有一个易于链接的优点” - 你可以将一个函数作为索引器传递，因此这样可以实现：s[lambda x: x].index。 - wjandrea

1

您可以使用pipe或loc来链接操作，当s是中间结果时并且您不想给它命名时，这非常有用。

s = pd.Series([True, False, True, True, False, False, False, True], index=list('ABCDEFGH'))

out = s.pipe(lambda s_: s_[s_].index)
# or
out = s.pipe(lambda s_: s_[s_]).index
# or
out = s.loc[lambda s_: s_].index

print(out)

Index(['A', 'C', 'D', 'H'], dtype='object')

- Ynjxsjmh

常规索引工作：s [lambda s_: s_] .index - wjandrea

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- rafaelc · Accepted Answer

使用布尔索引

>>> s = pd.Series([True, False, True, True, False, False, False, True])
>>> s[s].index
Int64Index([0, 2, 3, 7], dtype='int64')

如果需要一个`np.array`对象，请使用`.values`方法。

>>> s[s].index.values
array([0, 2, 3, 7])

使用`np.nonzero`函数

>>> np.nonzero(s)
(array([0, 2, 3, 7]),)

使用`np.flatnonzero`

>>> np.flatnonzero(s)
array([0, 2, 3, 7])

使用 `np.where`

>>> np.where(s)[0]
array([0, 2, 3, 7])

使用`np.argwhere`函数

>>> np.argwhere(s).ravel()
array([0, 2, 3, 7])

使用`pd.Series.index`

>>> s.index[s]
array([0, 2, 3, 7])

使用Python内置的filter函数

>>> [*filter(s.get, s.index)]
[0, 2, 3, 7]

使用列表推导式

>>> [i for i in s.index if s[i]]
[0, 2, 3, 7]

获取 Pandas 布尔系列为 True 的索引列表

使用布尔索引

使用np.nonzero函数

使用np.flatnonzero

使用 np.where

使用np.argwhere函数

使用pd.Series.index

使用np.where

使用np.flatnonzero

使用pd.Series.index

使用布尔索引

使用 pd.Series.where <-- 不在原始答案中

使用pd.Series.mask <-- 不在原始答案中

使用`np.nonzero`函数

使用`np.flatnonzero`

使用 `np.where`

使用`np.argwhere`函数

使用`pd.Series.index`

使用`np.where`

使用`np.flatnonzero`

使用`pd.Series.index`

使用 `pd.Series.where` <-- 不在原始答案中

使用`pd.Series.mask` <-- 不在原始答案中