一个序列的真值是不明确的。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

Question

一个序列的真值是不明确的。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

847

我想用一个“或”条件来过滤我的数据框，以保留特定列值在范围[-0.25, 0.25]之外的行。我尝试了：

df = df[(df['col'] < -0.25) or (df['col'] > 0.25)]

但是我遇到了一个错误：

ValueError: Series的真值是模棱两可的。请使用a.empty、a.bool()、a.item()、a.any()或a.all()。

- obabs

125

请使用符号 | 代替 or。 - MaxU - stand with Ukraine

7

这里有一个解决方法：abs(result['var'])>0.25 - ColinMac

7

相关：Pandas中用于布尔索引的逻辑运算符 - cs95

3

我在使用标准的 max() 函数时遇到了相同的错误消息。将其替换为 numpy.maximum() 用于两个值之间的逐元素最大值解决了我的问题。 - AstroFloyd

14个回答

127

Pandas使用按位运算符&和|。此外，每个条件都应该被包含在( )中。

这样做是可以的：

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

但是没有括号的相同查询不行：

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

- Nipun

每个条件为什么都应该用 ( ) 包裹起来？请通过编辑（更改）您的答案进行回复，而不是在评论中回复（但是不要包含“Edit:”，“Update:”或类似内容 - 答案应该看起来像是今天写的）。（但是不要包含“Edit:”，“Update:”或类似内容 - 答案应该看起来像是今天写的） - Peter Mortensen

这个答案可能会提供“为什么”的解释。 - Peter Mortensen

63

在布尔逻辑中，使用&和|。

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

为了看到发生了什么，您会得到每个比较的布尔列，例如：

df.C > 0.25

0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

当您有多个条件时，将返回多个列。这就是为什么连接逻辑是模糊的原因。and或or会单独处理每个列，因此您首先需要将该列缩减为单个布尔值。例如，查看每个列中任何值或所有值是否都为True。

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()

True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()

False

将所有这些列压缩在一起并执行相应的逻辑，是实现同样效果的一个复杂方法。

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

更多细节请参考文档中的布尔索引。

- Alexander

15

在Pandas中，当初学者要制作多个条件时，经常会遇到这样的问题。一般来说，导致这种错误的可能有两种情况：

情况1：Python运算符优先级

这里有一段布尔索引 | 索引和选择数据 - Pandas文档的文字说明:

另一个常见的操作是使用布尔向量来过滤数据。操作符为 | 表示或，& 表示和，~ 表示非。必须使用括号进行分组。

默认情况下，Python会将诸如 df['A'] > 2 & df['B'] < 3 的表达式评估为 df['A'] > (2 & df['B']) < 3，而所需的评估顺序是 (df['A'] > 2) & (df['B'] < 3)。

# Wrong
df['col'] < -0.25 | df['col'] > 0.25

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

有一些可能的方法可以消除括号，稍后我会涵盖这个话题。

条件2：错误的操作符/语句

如前面引述的解释，你需要用 | 表示 or，& 表示 and，~ 表示 not。

# Wrong
(df['col'] < -0.25) or (df['col'] > 0.25)

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

另一种可能的情况是您在 if 语句中使用布尔序列。

# Wrong
if pd.Series([True, False]):
    pass

很明显，Python的if语句接受类布尔值（Boolean-like expression）而非Pandas Series。你应使用pandas.Series.any或错误消息中列出的方法将Series转换为所需的值。

例如：

# Right
if df['col'].eq(0).all():
    # If you want all column values equal to zero
    print('do something')

# Right
if df['col'].eq(0).any():
    # If you want at least one column value equal to zero
    print('do something')

让我们谈一谈如何避免第一种情况中的括号。

Use Pandas mathematical functions

Pandas has defined a lot of mathematical functions, including comparison, as follows:
- pandas.Series.lt() for less than;
- pandas.Series.gt() for greater than;
- pandas.Series.le() for less and equal;
- pandas.Series.ge() for greater and equal;
- pandas.Series.ne() for not equal;
- pandas.Series.eq() for equal;
As a result, you can use
```
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

# is equal to

df = df[df['col'].lt(-0.25) | df['col'].gt(0.25)]
```
Use pandas.Series.between()

If you want to select rows in between two values, you can use pandas.Series.between:
- df['col].between(left, right) is equal to
  (left <= df['col']) & (df['col'] <= right);
- df['col].between(left, right, inclusive='left) is equal to
  (left <= df['col']) & (df['col'] < right);
- df['col].between(left, right, inclusive='right') is equal to
  (left < df['col']) & (df['col'] <= right);
- df['col].between(left, right, inclusive='neither') is equal to
  (left < df['col']) & (df['col'] < right);
```
df = df[(df['col'] > -0.25) & (df['col'] < 0.25)]

# is equal to

df = df[df['col'].between(-0.25, 0.25, inclusive='neither')]
```
Use pandas.DataFrame.query()

Document referenced before has a chapter The query() Method explains this well.

pandas.DataFrame.query() can help you select a DataFrame with a condition string. Within the query string, you can use both bitwise operators (& and |) and their boolean cousins (and and or). Moreover, you can omit the parentheses, but I don't recommend it for readability reasons.
```
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

# is equal to

df = df.query('col < -0.25 or col > 0.25')
```
Use pandas.DataFrame.eval()

pandas.DataFrame.eval() evaluates a string describing operations on DataFrame columns. Thus, we can use this method to build our multiple conditions. The syntax is the same with pandas.DataFrame.query().
```
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

# is equal to

df = df[df.eval('col < -0.25 or col > 0.25')]
```
pandas.DataFrame.query() and pandas.DataFrame.eval() can do more things than I describe here. You are recommended to read their documentation and have fun with them.

- Ynjxsjmh

15

或者，你也可以使用 operator 模块。更详细的信息请参阅 Python 文档。

import operator
import numpy as np
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

- Cảnh Toàn Nguyễn

Operator 帮我解决了 Jinja 的问题。 Jinja 不接受 & 运算符。Pandas 查询无法访问 Jinja 变量。但是，使用 .loc 运算符可以正常工作！谢谢！ - Dmitri K.

5

这个很好的答案很好地解释了正在发生的事情并提供了解决方案。我想再提供另一种解决方案，在类似情况下可能更合适：使用query方法：

df = df.query("(col > 0.25) or (col < -0.25)")

另请参阅 索引和选择数据。

(我目前使用的数据框进行了一些测试，表明这种方法比在布尔系列上使用按位运算符略慢：2毫秒 vs. 870微秒)

警告：至少有一种情况不太容易处理，即当列名恰好是Python表达式时。我有一些列的名称为WT_38hph_IP_2、WT_38hph_input_2和log2(WT_38hph_IP_2/WT_38hph_input_2)，并希望执行以下查询："(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

我得到了以下异常级联：

KeyError: 'log2'
UndefinedVariableError: name 'log2' is not defined
ValueError: "log2" is not a supported function

我想这是因为查询解析器试图从前两列中获取一些内容，而不是识别名称为第三列的表达式。

这里提出了一种可能的解决方法（链接）。

- bli

4

如果你有多个值：

df['col'].all()

如果只有一个数值：

df['col'].item()

- Humza Sami

3

我在执行这个命令时遇到了一个错误：

if df != '':
    pass

当我将它改为这个时，它就起作用了：

if df is not '':
    pass

- Mehdi Rostami

这很有趣，但可能只是偶然。有什么解释吗？ - Peter Mortensen

1

我在使用Panda数据框时遇到了相同的问题。

我使用了：numpy.logical_and：

这里我尝试选择与Id匹配且degree_type不是“Certification”的行。

就像下面这样：

display(df_degrees.loc[np.logical_and(df_degrees['person_id'] == '41d7853' , df_degrees['degree_type'] !='Certification')])

如果我尝试编写以下代码：

display(df_degrees.loc[df_degrees['person_id'] == '41d7853' and df_degrees['degree_type'] !='Certification'])

我们会收到错误提示：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我使用了numpy.logical_and，它对我有用。

- Gautam

1

我将尝试提供三种最常见方式的基准测试（也在上面提到）：

from timeit import repeat

setup = """
import numpy as np;
import random;
x = np.linspace(0,100);
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) * (x <= ub)]', 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100_000))
        print('%.4f' % t, stmt)
    print()

结果：

0.4808 x[(x > lb) * (x <= ub)]
0.4726 x[(x > lb) & (x <= ub)]
0.4904 x[np.logical_and(x > lb, x <= ub)]

0.4725 x[(x > lb) * (x <= ub)]
0.4806 x[(x > lb) & (x <= ub)]
0.5002 x[np.logical_and(x > lb, x <= ub)]

0.4781 x[(x > lb) * (x <= ub)]
0.4336 x[(x > lb) & (x <= ub)]
0.4974 x[np.logical_and(x > lb, x <= ub)]

但是，在Panda Series中不支持*，而且NumPy Array比pandas data frame更快（大约慢1000倍，见数字）：

from timeit import repeat

setup = """
import numpy as np;
import random;
import pandas as pd;
x = pd.DataFrame(np.linspace(0,100));
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100))
        print('%.4f' % t, stmt)
    print()

结果：

0.1964 x[(x > lb) & (x <= ub)]
0.1992 x[np.logical_and(x > lb, x <= ub)]

0.2018 x[(x > lb) & (x <= ub)]
0.1838 x[np.logical_and(x > lb, x <= ub)]

0.1871 x[(x > lb) & (x <= ub)]
0.1883 x[np.logical_and(x > lb, x <= ub)]

注意：添加一行代码 x = x.to_numpy() 需要大约 20 微秒。

对于那些喜欢 %timeit 的人：

import numpy as np
import random
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
lb, ub
x = pd.DataFrame(np.linspace(0,100))

def asterik(x):
    x = x.to_numpy()
    return x[(x > lb) * (x <= ub)]

def and_symbol(x):
    x = x.to_numpy()
    return x[(x > lb) & (x <= ub)]

def numpy_logical(x):
    x = x.to_numpy()
    return x[np.logical_and(x > lb, x <= ub)]

for i in range(3):
    %timeit asterik(x)
    %timeit and_symbol(x)
    %timeit numpy_logical(x)
    print('\n')

结果：

23 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.6 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
31.3 µs ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


21.4 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.9 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.7 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


25.1 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.8 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.2 µs ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

- Muhammad Yasirroni

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MSeifert · Accepted Answer

or和and Python语句需要真值（truth-values）。对于pandas，这些被认为是模棱两可的，因此你应该使用“按位”|（或）或&（与）操作：

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

这些被重载以处理这些种类的数据结构，产生逐个元素的or或and。

只是为了对这个陈述进行一些更多的解释：

当您想要获取pandas.Series的bool时，将引发异常：

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

你遇到了一个操作符隐式地将操作数转换为bool的情况（你使用了or，但它也会出现在and、if和while中）：

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

除了这四个语句外，还有一些Python函数隐藏了一些bool调用（比如`any`、`all`、`filter`等）。这些通常不会与`pandas.Series`产生问题，但为了完整起见，我想提一下这些。

在您的情况下，异常并没有提到正确的替代方案，所以并不是很有帮助。对于`and`和`or`，如果您想进行逐元素比较，可以使用：

numpy.logical_or：

>>> import numpy as np
>>> np.logical_or(x, y)

或者简单地使用 | 运算符：

>>> x | y

numpy.logical_and：
```
>>> np.logical_and(x, y)
```
或者简单地说，& 运算符：
```
>>> x & y
```

如果你使用了操作符，请确保正确设置括号，因为会受到运算符优先级的影响。

有几个逻辑 NumPy 函数，它们应该可以在 pandas.Series 上使用。

异常中提到的替代方法更适用于在执行 if 或 while 时遇到此问题。我将简要解释每个方法：

如果您想检查您的 Series 是否为空：

>>> x = pd.Series([])
>>> x.empty
True
>>> x = pd.Series([1])
>>> x.empty
False

正常情况下，Python将容器（如list、tuple等）的长度解释为真值，如果没有明确的布尔解释。因此，如果您想要类似Python的检查，可以使用if x.size或if not x.empty代替if x。

如果您的Series包含一个且仅一个布尔值：

>>> x = pd.Series([100])
>>> (x > 50).bool()
True
>>> (x < 50).bool()
False

如果您想检查Series的第一个且唯一的项（类似于.bool()，但即使对于非布尔内容也有效）：

>>> x = pd.Series([100])
>>> x.item()
100

如果您想检查所有或任何项目是否为非零、非空或非False：

>>> x = pd.Series([0, 1, 2])
>>> x.all()   # Because one element is zero
False
>>> x.any()   # because one (or more) elements are non-zero
True