提高Python正则表达式的性能

Question

提高Python正则表达式的性能

3

尝试改进以下正则表达式：

urlpath=columns[4].strip()
                                urlpath=re.sub("(\?.*|\/[0-9a-f]{24})","",urlpath)
                                urlpath=re.sub("\/[0-9\/]*","/",urlpath)
                                urlpath=re.sub("\;.*","",urlpath)
                                urlpath=re.sub("\/",".",urlpath)
                                urlpath=re.sub("\.api","api",urlpath)
                                if urlpath in dlatency:

这将把这样的URL进行转换：

/api/v4/path/apiCallTwo?host=wApp&trackId=1347158

to

api.v4.path.apiCallTwo

希望尝试改进正则表达式以提高性能，因为每5分钟这个脚本大约要运行50000个文件，并且总共需要大约40秒的时间。

谢谢。

- coderwhiz

2

你确定正则表达式是脚本中的瓶颈，而不是硬盘？ - Fred Foo

磁盘IO相当低。脚本逆行按行读取日志文件，直到达到超过5分钟的行。 - coderwhiz

2

这是基于对代码进行剖析还是凭直觉？ - hexparrot

iostat -kxd 2 在脚本运行期间显示非常少的磁盘IO。 - coderwhiz

这个具体的案例是关于URL的，所以像其他人回答的那样，你可以用其他工具来解决它。我曾经遇到过这个正则表达式速度慢的问题——等了两分钟以上才能完成替换。安装了regex包——运行速度快且效果很好！你可以从这里下载：https://pypi.python.org/pypi/regex - SomethingSomething

6个回答

2

使用 urlparse 的一行代码：

urlpath = urlparse.urlsplit(url).path.strip('/').replace('/', '.')

- badzil

2

这是我的一行代码解决方案（已编辑）。

urlpath.partition("?")[0].strip("/").replace("/", ".")

正如其他人提到的那样，在这里速度改善微不足道。除了使用re.compile()预编译表达式之外，我会开始寻找别的方法。

import re


re1 = re.compile("(\?.*|\/[0-9a-f]{24})")
re2 = re.compile("\/[0-9\/]*")
re3 = re.compile("\;.*")
re4 = re.compile("\/")
re5 = re.compile("\.api")
def orig_regex(urlpath):
    urlpath=re1.sub("",urlpath)
    urlpath=re2.sub("/",urlpath)
    urlpath=re3.sub("",urlpath)
    urlpath=re4.sub(".",urlpath)
    urlpath=re5.sub("api",urlpath)
    return urlpath


myregex = re.compile(r"([^/]+)")
def my_regex(urlpath):
    return ".".join( x.group() for x in myregex.finditer(urlpath.partition('?')[0]))

def test_nonregex(urlpath)
    return urlpath.partition("?")[0].strip("/").replace("/", ".")

def test_func(func, iterations, *args, **kwargs):
    for i in xrange(iterations):
        func(*args, **kwargs)

if __name__ == "__main__":
    import cProfile as profile

    urlpath = u'/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
    profile.run("test_func(orig_regex, 10000, urlpath)")
    profile.run("test_func(my_regex, 10000, urlpath)")
    profile.run("test_func(non_regex, 10000, urlpath)")

结果

Iterating orig_regex 10000 times
     60003 function calls in 0.108 CPU seconds

....

Iterating my_regex 10000 times
     130003 function calls in 0.087 CPU seconds

....

Iterating non_regex 10000 times
     40003 function calls in 0.019 CPU seconds

在你的5个正则表达式结果中不进行re.compile操作

running <function orig_regex at 0x100532050> 10000 times
     210817 function calls (210794 primitive calls) in 0.208 CPU seconds

- jlujan

1

逐行查看：

您没有捕获或分组，因此不需要使用 ( 和 )，而且在 Python 的正则表达式中，/ 不是特殊字符，因此不需要转义：

urlpath = re.sub("\?.*|/[0-9a-f]{24}", "", urlpath)

用一个 / 后面跟零个重复的东西来替换一个 / 是毫无意义的：

urlpath = re.sub("/[0-9/]+", "/", urlpath)

使用字符串方法更快地删除固定字符及其后面的所有内容：

urlpath = urlpath.partition(";")[0]

使用字符串方法更快地将一个固定字符串替换为另一个固定字符串：

urlpath = urlpath.replace("/", ".")

urlpath = urlpath.replace(".api", "api")

- MRAB

0

您还可以编译re语句以获得更好的性能提升，

例如：

compiled_re_for_words = re.compile("\w+")
compiled_re_for_words.match("test")

- Jakob Bowyer

0

你确定需要使用正则表达式吗？
例如，

urlpath = columns[4].strip()
urlpath = urlpath.split("?")[0]
urlpath = urlpath.replace("/", ".")

- user1417475

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Óscar López · Accepted Answer

试试这个：

s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
re.sub(r'\?.+', '', s).replace('/', '.')[1:]
> 'api.v4.path.apiCallTwo'

为了获得更好的性能，可以先编译一次正则表达式并重复使用，如下所示：

regexp = re.compile(r'\?.+')
s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'

# `s` changes, but you can reuse `regexp` as many times as needed
regexp.sub('', s).replace('/', '.')[1:]

一个更简单的方法，不使用正则表达式：

s[1:s.index('?')].replace('/', '.')
> 'api.v4.path.apiCallTwo'