在Python 2中是否有一种方法可以在正则表达式中使用memoryview？

Question

在Python 2中是否有一种方法可以在正则表达式中使用memoryview？

3

在Python 3中，re模块可以与memoryview一起使用：

~$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = b"abc"
>>> import re
>>> re.search(b"b", memoryview(x))
<_sre.SRE_Match object at 0x7f14b5fb8988>

然而，在Python 2中似乎并非如此：

~$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "abc"
>>> import re
>>> re.search(b"b", memoryview(x))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

我可以将字符串转换为 buffer，但是查看缓冲区文档，它没有详细说明 buffer 如何与 memoryview 相比工作。

通过实证比较发现，在 Python 2 中使用 buffer 对象并不能像在 Python 3 中使用 memoryview 那样提供性能优势。

playground$ cat speed-test.py
import timeit
import sys

print(timeit.timeit("regex.search(mv[10:])", setup='''
import re
regex = re.compile(b"ABC")
PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
    mv = memoryview(b"Can you count to three or sing 'ABC?'" * 1024)
else:
    mv = buffer(b"Can you count to three or sing 'ABC?'" * 1024)
'''))
playground$ python2.7 speed-test.py
2.33041596413
playground$ python2.7 speed-test.py
2.3322429657
playground$ python3.2 speed-test.py
0.381270170211792
playground$ python3.2 speed-test.py
0.3775448799133301
playground$

如果将regex.search参数从mv[10:]更改为mv，Python 2的性能与Python 3的性能大致相同，但在我编写的代码中，存在大量重复的字符串切片。

有没有一种方法可以在Python 2中避免这个问题，同时仍然拥有memoryview的零拷贝性能优势？

- Eric Pruitt

Memoryview 在Python 2中支持缓冲协议。我认为最根本的区别在于 re 如何获取 Python 2 和 Python 3 之间的缓存指针。这个更改有一个专门的 PEP -- 可以看一下 PEP 3118。 - Seyeong Jeong

但是你为什么要在 re.search 中使用 memoryview 呢？我不认为你会从中获得任何性能上的好处。

~ » python3 -m timeit 'import re; x = b"abc"; re.search(b"b", memoryview(x))' 100000 次循环，3 次取最佳结果：每个循环耗时 2.25 微秒

~ » python3 -m timeit 'import re; x = b"abc"; re.search(b"b", x)'       1000000 次循环，3 次取最佳结果：每个循环耗时 1.79 微秒

- Seyeong Jeong

@SeyeongJeong，这不是一个很好的测试。你在每个循环中都导入了“re”模块，而且你还在每次调用时重新创建了一个memoryview对象。在我的用例中，我会在字符串的不同偏移量上重复调用re.search。使用a_string[offset：]，Python每次都会创建一个新的字符串，但是使用a_memoryview[offset：]，Python会重复使用现有的缓冲区，尽管最终用户请求了一个切片。 - Eric Pruitt

@SeyeongJeong，我已经更新了我的帖子，并提供了一个速度测试，它比你的测试更贴近我的场景。 - Eric Pruitt

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- poke · Accepted Answer

我理解中的Python 2中缓冲区对象的使用方式是不需要切片操作：

>>> s = b"Can you count to three or sing 'ABC?'"
>>> str(buffer(s, 10))
"unt to three or sing 'ABC?'"

因此，不是对生成的缓冲区进行切片，而是直接使用缓冲区函数来执行您感兴趣的子字符串切片，从而实现快速访问：

import timeit
import sys
import re

r = re.compile(b'ABC')
s = b"Can you count to three or sing 'ABC?'" * 1024

PYTHON_3 = sys.version_info >= (3, )
if len(sys.argv) > 1: # standard slicing
    print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s'))
elif PYTHON_3: # memoryview in Python 3
    print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s; s = memoryview(s)'))
else: # buffer in Python 2
    print(timeit.timeit("r.search(buffer(s, 10))", setup='from __main__ import r, s'))

我在Python 2和3中得到了非常相似的结果，这表明使用buffer与re模块类似于使用较新的memoryview（然后似乎是一种惰性评估缓冲区）：

$ python2 .\speed-test.py
0.681979371561
$ python3 .\speed-test.py
0.5693422508853488

与标准字符串切片相比：

$ python2 .\speed-test.py standard-slicing
7.92006735956
$ python3 .\speed-test.py standard-slicing
7.817641705304309

如果你想支持切片访问（这样你就可以在任何地方使用相同的语法），你可以轻松创建一个类型，在其上进行切片时动态创建一个新缓冲区：

class slicingbuffer:
    def __init__ (self, source):
        self.source = source
    def __getitem__ (self, index):
        if not isinstance(index, slice):
            return buffer(self.source, index, 1)
        elif index.stop is None:
            return buffer(self.source, index.start)
        else:
            end = max(index.stop - index.start, 0)
            return buffer(self.source, index.start, end)

如果您只使用re模块，它可能可以直接替换memoryview。但是，我的测试显示这已经给您带来了很大的开销。因此，您可能希望反过来，将Python 3的memoryview对象包装在一个包装器中，使其具有与buffer相同的接口：

def memoryviewbuffer (source, start, end = -1):
    return source[start:end]

PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
    b = memoryviewbuffer
    s = memoryview(s)
else:
    b = buffer

print(timeit.timeit("r.search(b(s, 10))", setup='from __main__ import r, s, b'))