为什么编译后的Python正则表达式速度较慢？

Question

为什么编译后的Python正则表达式速度较慢？

11

在另一个SO问题中，比较了正则表达式和Python的in运算符的性能。但是，被接受的答案使用re.match仅匹配字符串的开头，因此与in完全不同。此外，我想看看不重新编译RE每次都可以获得的性能提升。

令人惊讶的是，我发现预编译版本似乎更加慢。有任何想法吗？

我知道这里有很多其他问题都在思考类似的问题。它们中的大部分之所以表现出现有的方式，只是因为它们没有正确地重复使用已编译的正则表达式。如果这也是我的问题，请解释一下。

from timeit import timeit
import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

def find():
    assert text.find(pattern) > -1

def re_search():
    assert re.search(pattern, text)

def re_compiled():
    assert re.search(compiled_pattern, text)

def in_find():
    assert pattern in text

print('str.find     ', timeit(find))
print('re.search    ', timeit(re_search))
print('re (compiled)', timeit(re_compiled))
print('in           ', timeit(in_find))

输出:

str.find      0.36285957560356435
re.search     1.047689160564772
re (compiled) 1.575113873320307
in            0.1907925627077569

- polwel

8

你应该使用compiled_pattern.search(text)而不是re.search(compiled_pattern, text)。 - asongtoruin

1

compiled_pattern.search(text)会快3倍。不过，很奇怪的是，re.search(compiled_pattern, text)似乎不能做到同样的效果。 - tobias_k

1

@tobias_k 看一下这个声明，调用 re.search 只是简单地执行 return _compile(pattern, flags).search(string)。没有检查模式是否已经编译。 - asongtoruin

3

可以的。看 re.py:294。(Py3.6的行号)。 - polwel

1

值得一提的是，我链接的那个问题有 Python 核心开发者 Raymond Hettinger 的回答，但它添加得比较晚，所以不在页面顶部。 - PM 2Ring

显示剩余5条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Eric Duminil · Accepted Answer

简短回答

如果您直接调用compiled_pattern.search(text)，它不会调用_compile，因此它比re.search(pattern, text)快得多，并且比re.search(compiled_pattern, text)更快。

这种性能差异是由于缓存中的KeyError和编译模式的慢哈希计算引起的。

`re`函数和`SRE_Pattern`方法

每当调用具有pattern作为第一个参数的re函数（例如re.search(pattern, string)或re.findall(pattern, string)）时，Python 首先尝试使用_compile编译pattern，然后调用编译模式对应的方法。例如：代码：

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

请注意，pattern 可以是字符串或已编译的模式（SRE_Pattern 实例）。

_compile

这是 _compile 的简化版本。我只是删掉了调试和标志检查：

_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512

def _compile(pattern, flags):
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    loc = None
    _cache[type(pattern), pattern, flags] = p, loc
    return p

`_compile`和字符串模式

当使用字符串模式调用_compile时，编译后的模式将保存在_cache字典中。下次调用相同函数时（例如在许多timeit运行期间），_compile简单地检查_cache是否已经看到过此字符串，并返回相应的编译模式。

使用Spyder中的ipdb调试器，在执行过程中轻松深入了解re.py。

import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

re.search(pattern, text)
re.search(pattern, text)

在第二个re.search(pattern, text)处设置断点后，可以看到这对：

{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}

存储在_cache中。编译后的模式将直接返回。

使用已编译的模式进行`_compile`

慢速哈希

如果已经编译了一个模式，调用_compile会发生什么？

首先，_compile会检查模式是否在_cache中。为此，它需要计算其哈希值。对于编译后的模式，这种计算比字符串要慢得多：

In [1]: import re

In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
   ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
   ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
   ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
   ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
   ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
   ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
   ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"

In [3]: compiled_pattern = re.compile(pattern)

In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

hash(compiled_pattern)比hash(pattern)慢60倍。

`KeyError`

当一个未知的pattern出现时，_cache[type(pattern), pattern, flags]会出现KeyError错误。

KeyError被处理并忽略。然后，_compile才检查该模式是否已经编译。如果已经编译，则返回该模式，而不写入缓存。

这意味着下一次使用相同的已编译模式调用_compile时，它将再次计算无用的、缓慢的哈希值，但仍然会出现KeyError错误。

错误处理是昂贵的，我想这可能是为什么re.search(compiled_pattern, text)比re.search(pattern, text)慢的主要原因。

这种奇怪的行为可能是为了加速字符串模式的调用，但如果_compile使用已经编译过的模式，可能写一个警告是个好主意。

为什么编译后的Python正则表达式速度较慢？

简短回答

re函数和SRE_Pattern方法

_compile

_compile和字符串模式

使用已编译的模式进行_compile

慢速哈希

KeyError

`re`函数和`SRE_Pattern`方法

`_compile`和字符串模式

使用已编译的模式进行`_compile`

`KeyError`