df = df[(df['col'] < -0.25) or (df['col'] > 0.25)]
但是我遇到了一个错误:
ValueError: Series的真值是模棱两可的。请使用a.empty、a.bool()、a.item()、a.any()或a.all()。
or
和and
Python语句需要真值(truth-values)。对于pandas,这些被认为是模棱两可的,因此你应该使用“按位”|
(或)或&
(与)操作:
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
这些被重载以处理这些种类的数据结构,产生逐个元素的or
或and
。
只是为了对这个陈述进行一些更多的解释:
当您想要获取pandas.Series
的bool
时,将引发异常:
>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
你遇到了一个操作符隐式地将操作数转换为bool
的情况(你使用了or
,但它也会出现在and
、if
和while
中):
>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
... print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
... print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
除了这四个语句外,还有一些Python函数隐藏了一些bool调用(比如`any`、`all`、`filter`等)。这些通常不会与`pandas.Series`产生问题,但为了完整起见,我想提一下这些。
>>> import numpy as np
>>> np.logical_or(x, y)
或者简单地使用 |
运算符:
>>> x | y
>>> np.logical_and(x, y)
或者简单地说,&
运算符:
>>> x & y
如果你使用了操作符,请确保正确设置括号,因为会受到运算符优先级的影响。
有几个逻辑 NumPy 函数,它们应该可以在 pandas.Series
上使用。
异常中提到的替代方法更适用于在执行 if
或 while
时遇到此问题。我将简要解释每个方法:
如果您想检查您的 Series 是否为空:
>>> x = pd.Series([])
>>> x.empty
True
>>> x = pd.Series([1])
>>> x.empty
False
正常情况下,Python将容器(如list
、tuple
等)的长度解释为真值,如果没有明确的布尔解释。因此,如果您想要类似Python的检查,可以使用if x.size
或if not x.empty
代替if x
。
如果您的Series
包含一个且仅一个布尔值:
>>> x = pd.Series([100])
>>> (x > 50).bool()
True
>>> (x < 50).bool()
False
如果您想检查Series的第一个且唯一的项(类似于.bool()
,但即使对于非布尔内容也有效):
>>> x = pd.Series([100])
>>> x.item()
100
如果您想检查所有或任何项目是否为非零、非空或非False:
>>> x = pd.Series([0, 1, 2])
>>> x.all() # Because one element is zero
False
>>> x.any() # because one (or more) elements are non-zero
True
&
和|
。此外,每个条件都应该被包含在( )
中。data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]
但是没有括号的相同查询不行:
data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]
( )
包裹起来?请通过编辑(更改)您的答案进行回复,而不是在评论中回复(但是不要包含“Edit:”,“Update:”或类似内容 - 答案应该看起来像是今天写的)。 (但是不要包含“Edit:”,“Update:”或类似内容 - 答案应该看起来像是今天写的) - Peter Mortensen在布尔逻辑中,使用&
和|
。
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
>>> df
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
为了看到发生了什么,您会得到每个比较的布尔列,例如:
df.C > 0.25
0 True
1 False
2 False
3 True
4 True
Name: C, dtype: bool
当您有多个条件时,将返回多个列。这就是为什么连接逻辑是模糊的原因。and
或or
会单独处理每个列,因此您首先需要将该列缩减为单个布尔值。例如,查看每个列中任何值或所有值是否都为True。
# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True
# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False
>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
在Pandas中,当初学者要制作多个条件时,经常会遇到这样的问题。一般来说,导致这种错误的可能有两种情况:
情况1:Python运算符优先级
这里有一段布尔索引 | 索引和选择数据 - Pandas文档的文字说明:
另一个常见的操作是使用布尔向量来过滤数据。操作符为
|
表示或,&
表示和,~
表示非。必须使用括号进行分组。默认情况下,Python会将诸如
df['A'] > 2 & df['B'] < 3
的表达式评估为df['A'] > (2 & df['B']) < 3
,而所需的评估顺序是(df['A'] > 2) & (df['B'] < 3)
。
# Wrong
df['col'] < -0.25 | df['col'] > 0.25
# Right
(df['col'] < -0.25) | (df['col'] > 0.25)
有一些可能的方法可以消除括号,稍后我会涵盖这个话题。
条件2:错误的操作符/语句
如前面引述的解释,你需要用 |
表示 or
,&
表示 and
,~
表示 not
。
# Wrong
(df['col'] < -0.25) or (df['col'] > 0.25)
# Right
(df['col'] < -0.25) | (df['col'] > 0.25)
另一种可能的情况是您在 if
语句中使用布尔序列。
# Wrong
if pd.Series([True, False]):
pass
很明显,Python的if
语句接受类布尔值(Boolean-like expression)而非Pandas Series。你应使用pandas.Series.any
或错误消息中列出的方法将Series转换为所需的值。
例如:
# Right
if df['col'].eq(0).all():
# If you want all column values equal to zero
print('do something')
# Right
if df['col'].eq(0).any():
# If you want at least one column value equal to zero
print('do something')
让我们谈一谈如何避免第一种情况中的括号。
Use Pandas mathematical functions
Pandas has defined a lot of mathematical functions, including comparison, as follows:
pandas.Series.lt()
for less than;pandas.Series.gt()
for greater than;pandas.Series.le()
for less and equal;pandas.Series.ge()
for greater and equal;pandas.Series.ne()
for not equal;pandas.Series.eq()
for equal;As a result, you can use
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
# is equal to
df = df[df['col'].lt(-0.25) | df['col'].gt(0.25)]
If you want to select rows in between two values, you can use pandas.Series.between
:
df['col].between(left, right)
is equal to (left <= df['col']) & (df['col'] <= right)
;df['col].between(left, right, inclusive='left)
is equal to (left <= df['col']) & (df['col'] < right)
;df['col].between(left, right, inclusive='right')
is equal to (left < df['col']) & (df['col'] <= right)
;df['col].between(left, right, inclusive='neither')
is equal to (left < df['col']) & (df['col'] < right)
;df = df[(df['col'] > -0.25) & (df['col'] < 0.25)]
# is equal to
df = df[df['col'].between(-0.25, 0.25, inclusive='neither')]
Document referenced before has a chapter The query()
Method explains this well.
pandas.DataFrame.query()
can help you select a DataFrame with a condition string. Within the query string, you can use both bitwise operators (&
and |
) and their boolean cousins (and
and or
). Moreover, you can omit the parentheses, but I don't recommend it for readability reasons.
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
# is equal to
df = df.query('col < -0.25 or col > 0.25')
pandas.DataFrame.eval()
evaluates a string describing operations on DataFrame columns. Thus, we can use this method to build our multiple conditions. The syntax is the same with pandas.DataFrame.query()
.
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
# is equal to
df = df[df.eval('col < -0.25 or col > 0.25')]
pandas.DataFrame.query()
and pandas.DataFrame.eval()
can do more things than I describe here. You are recommended to read their documentation and have fun with them.
或者,你也可以使用 operator 模块。更详细的信息请参阅 Python 文档。
import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.4438
这个很好的答案很好地解释了正在发生的事情并提供了解决方案。我想再提供另一种解决方案,在类似情况下可能更合适:使用query
方法:
df = df.query("(col > 0.25) or (col < -0.25)")
另请参阅 索引和选择数据。
(我目前使用的数据框进行了一些测试,表明这种方法比在布尔系列上使用按位运算符略慢:2毫秒 vs. 870微秒)
警告:至少有一种情况不太容易处理,即当列名恰好是Python表达式时。我有一些列的名称为WT_38hph_IP_2
、WT_38hph_input_2
和log2(WT_38hph_IP_2/WT_38hph_input_2)
,并希望执行以下查询:"(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"
我得到了以下异常级联:
KeyError: 'log2'
UndefinedVariableError: name 'log2' is not defined
ValueError: "log2" is not a supported function
我想这是因为查询解析器试图从前两列中获取一些内容,而不是识别名称为第三列的表达式。
这里提出了一种可能的解决方法 (链接)。
如果你有多个值:
df['col'].all()
如果只有一个数值:
df['col'].item()
我在执行这个命令时遇到了一个错误:
if df != '':
pass
当我将它改为这个时,它就起作用了:
if df is not '':
pass
display(df_degrees.loc[np.logical_and(df_degrees['person_id'] == '41d7853' , df_degrees['degree_type'] !='Certification')])
display(df_degrees.loc[df_degrees['person_id'] == '41d7853' and df_degrees['degree_type'] !='Certification'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
我将尝试提供三种最常见方式的基准测试(也在上面提到):
from timeit import repeat
setup = """
import numpy as np;
import random;
x = np.linspace(0,100);
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) * (x <= ub)]', 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'
for _ in range(3):
for stmt in stmts:
t = min(repeat(stmt, setup, number=100_000))
print('%.4f' % t, stmt)
print()
结果:
0.4808 x[(x > lb) * (x <= ub)]
0.4726 x[(x > lb) & (x <= ub)]
0.4904 x[np.logical_and(x > lb, x <= ub)]
0.4725 x[(x > lb) * (x <= ub)]
0.4806 x[(x > lb) & (x <= ub)]
0.5002 x[np.logical_and(x > lb, x <= ub)]
0.4781 x[(x > lb) * (x <= ub)]
0.4336 x[(x > lb) & (x <= ub)]
0.4974 x[np.logical_and(x > lb, x <= ub)]
但是,在Panda Series中不支持*
,而且NumPy Array比pandas data frame更快(大约慢1000倍,见数字):
from timeit import repeat
setup = """
import numpy as np;
import random;
import pandas as pd;
x = pd.DataFrame(np.linspace(0,100));
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'
for _ in range(3):
for stmt in stmts:
t = min(repeat(stmt, setup, number=100))
print('%.4f' % t, stmt)
print()
结果:
0.1964 x[(x > lb) & (x <= ub)]
0.1992 x[np.logical_and(x > lb, x <= ub)]
0.2018 x[(x > lb) & (x <= ub)]
0.1838 x[np.logical_and(x > lb, x <= ub)]
0.1871 x[(x > lb) & (x <= ub)]
0.1883 x[np.logical_and(x > lb, x <= ub)]
x = x.to_numpy()
需要大约 20 微秒。%timeit
的人:import numpy as np
import random
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
lb, ub
x = pd.DataFrame(np.linspace(0,100))
def asterik(x):
x = x.to_numpy()
return x[(x > lb) * (x <= ub)]
def and_symbol(x):
x = x.to_numpy()
return x[(x > lb) & (x <= ub)]
def numpy_logical(x):
x = x.to_numpy()
return x[np.logical_and(x > lb, x <= ub)]
for i in range(3):
%timeit asterik(x)
%timeit and_symbol(x)
%timeit numpy_logical(x)
print('\n')
结果:
23 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.6 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
31.3 µs ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.4 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.9 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.7 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
25.1 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.8 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.2 µs ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
|
代替or
。 - MaxU - stand with Ukraineabs(result['var'])>0.25
- ColinMacmax()
函数时遇到了相同的错误消息。将其替换为numpy.maximum()
用于两个值之间的逐元素最大值解决了我的问题。 - AstroFloyd