看到一些很棒的回答后,
@arantius的评论(关于定时
$x
vs
x^
vs
(?!x)x
)对目前接受的答案产生了影响,让我想测试一下目前为止给出的解决方案。
使用@arantius的275k行标准,在Python(v3.5.2,IPython 6.2.1)中运行了以下测试。
总之: 'x^'
和'x\by'
是最快的,速度至少快16倍,与@arantius的发现相反,(?!x)x
是最慢的(慢约37倍)。因此,速度问题肯定与实现有关。如果速度对您很重要,请在提交之前在您的系统上自行测试。
更新:显然,"x^"和"a^"的时间差异很大。请参见
this question以获取更多信息,并查看先前的编辑,其中包括使用"a"而不是"x"的较慢时间。
In [1]: import re
In [2]: with open('/tmp/longfile.txt') as f:
...: longfile = f.read()
...:
In [3]: len(re.findall('\n',longfile))
Out[3]: 275000
In [4]: len(longfile)
Out[4]: 24733175
In [5]: for regex in ('x^','.^','$x','$.','$x^','$.^','$^','(?!x)x','(?!)','(?=x)y','(?=x)(?!x)',r'x\by',r'x\bx',r'^\b$'
...: ,r'\B\b',r'\ZNEVERMATCH\A',r'\Z\A'):
...: print('-'*72)
...: print(regex)
...: %timeit re.search(regex,longfile)
...:
x^
6.98 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
.^
155 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
$x
111 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
$.
111 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
$x^
112 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
$.^
113 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
$^
111 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
(?!x)x
257 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(?!)
203 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(?=x)y
204 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(?=x)(?!x)
210 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
x\by
7.41 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
x\bx
7.42 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
^\b$
108 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
\B\b
387 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
\ZNEVERMATCH\A
112 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
\Z\A
112 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
第一次运行时,我忘记了对最后三个表达式进行原始字符串处理,因此'\b'被解释为退格符'\x08'。然而,令我惊讶的是,'a\x08c'比之前最快的结果还要快!公平地说,它仍然能匹配那段文本,但我认为这仍然值得注意,因为我不确定为什么它更快。
In [6]: for regex in ('x\by','x\bx','^\b$','\B\b'):
...: print('-'*72)
...: print(regex, repr(regex))
...:
...: print(re.search(regex,longfile))
...:
------------------------------------------------------------------------
y 'x\x08y'
5.32 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
x 'x\x08x'
5.34 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
$ '^\x08$'
122 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
None
------------------------------------------------------------------------
\ '\\B\x08'
300 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
None
我的测试文件是使用
"...可读内容和无重复行"公式创建的(在Ubuntu 16.04上)。
$ ruby -e 'a=STDIN.readlines;275000.times do;b=[];rand(20).times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > /tmp/longfile.txt
$ head -n5 /tmp/longfile.txt
unavailable speedometer's garbling Zambia subcontracted fullbacks Belmont mantra's
pizzicatos carotids bitch Hernandez renovate leopard Knuth coarsen
Ramada flu occupies drippings peaces siroccos Bartók upside twiggier configurable perpetuates tapering pint paralyzed
vibraphone stoppered weirdest dispute clergy's getup perusal fork
nighties resurgence chafe