Pandas中的逐元素异或

17
我知道在Pandas Series中,逻辑AND用符号&表示,逻辑OR用符号|表示,但我正在寻找逐元素的逻辑XOR。我可以使用AND和OR来表达它,但如果有XOR可用,我更愿意使用它。
谢谢!
2个回答

23

Python异或运算: a ^ b

Numpy逻辑异或: np.logical_xor(a,b)

性能测试 - 结果相等:

1. 大小为10000的随机布尔序列

In [7]: a = np.random.choice([True, False], size=10000)
In [8]: b = np.random.choice([True, False], size=10000)

In [9]: %timeit a ^ b
The slowest run took 7.61 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 11 us per loop

In [10]: %timeit np.logical_xor(a,b)
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 11 us per loop

2. 大小为1000的随机布尔序列

In [11]: a = np.random.choice([True, False], size=1000)
In [12]: b = np.random.choice([True, False], size=1000)

In [13]: %timeit a ^ b
The slowest run took 21.52 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.58 us per loop

In [14]: %timeit np.logical_xor(a,b)
The slowest run took 19.45 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.58 us per loop

3. 长度为100的随机布尔序列

In [15]: a = np.random.choice([True, False], size=100)
In [16]: b = np.random.choice([True, False], size=100)

In [17]: %timeit a ^ b
The slowest run took 33.43 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 614 ns per loop

In [18]: %timeit np.logical_xor(a,b)
The slowest run took 45.49 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 616 ns per loop

4. 长度为10的随机布尔值序列

In [19]: a = np.random.choice([True, False], size=10)
In [20]: b = np.random.choice([True, False], size=10)

In [21]: %timeit a ^ b
The slowest run took 86.10 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 509 ns per loop

In [22]: %timeit np.logical_xor(a,b)
The slowest run took 40.94 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 511 ns per loop

1
Python中的异或运算符^被NumPy库重载,以在内部执行numpy.logical_xor。因此,读者应该注意,性能测试结果相等,因为它们是相同的。 - Alok Nayak

0
我发现了一个问题,a^bnp.logical_xor(a,b)并不等价,这让我很困惑,但最后只是一个简单的修复。希望这能帮助其他人避免头疼。
我最近从Pandas 0.25.3升级到2.0.3(numpy从1.19.0升级到1.24.4),这引发了这个问题。
假设a是一个具有Index上重复值的DataFrameb是一个bool类型的Series,其中b.index == a.columns
我的意图是将b广播到a,并对每一行的ab进行逐元素异或操作,其中a.index上的任何重复值都应该传递到输出结果中。
这段代码在我的旧设置上运行正常...
np.logical_xor(a,b.to_frame().T)

...但是在我的新设置上失败了:

TypeError: '<' not supported between instances of 'Timestamp' and 'int'

我相信是因为广播中的某些内容试图将bb.index是一个无意义的[0])连接到具有时间戳索引的a上,我相信需要对其进行排序以使其单调。
解决方案是,正如这个问题的提出者让我考虑的那样:
a^b

这个令人恼火/美妙的事情是,这似乎也适用于我的旧版pandas/numpy "生产"设置。巧合的是,这是我第一次使用"git blame"。答案是:"初始提交"3年前,所以要么在更早版本的Pandas中a^b不起作用,要么是我不知道它的存在。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接