Python中的re.sub非贪婪替换在字符串中包含换行符时失败

Question

Python中的re.sub非贪婪替换在字符串中包含换行符时失败

pythonregex

6

我在Python（2.7.9）的正则表达式中遇到了问题。

我正在尝试使用以下正则表达式剥离HTML <span>标签：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)

(这个正则表达式读起来就是：匹配<span，除了>之外的任何字符，然后是一个>，接着非贪婪地匹配任何字符，最后是一个</span>，并使用re.S (re.DOTALL)使.可以匹配换行符)

这似乎很好用，除非文本中有换行符。看起来在非贪婪匹配中，re.S (DOTALL)不适用。

这是测试代码；从text1中删除换行符，re.sub可以工作。将其放回，re.sub 将无法正常工作。将换行符放在<span>标签外面，re.sub将可以正常工作。

#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)

为了对比，我编写了一段Perl脚本来完成相同的事情；正则表达式在这里的工作方式与我预期的相同。

#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";

有什么想法吗？

经过测试，适用于Python 2.6.6和Python 2.7.9。

- Andy Watkins

另一个相同的问题：https://dev59.com/0nVD5IYBdhLWcg3wO5ED。这个问题非常普遍。答案是：阅读[文档](https://docs.python.org/2/library/re.html#re.sub)。 - Wiktor Stribiżew

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- falsetru · Accepted Answer

re.sub的第四个参数是count，而不是flags。

re.sub(pattern, repl, string, count=0, flags=0)¶

您需要使用关键字参数来明确指定 flags：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
                                                      ↑↑↑↑↑↑

否则，re.S将被解释为替换计数（最多16次），而不是S（或DOTALL标志）：

>>> import re
>>> re.S
16

>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'