如何在Python中将长的正则表达式规则拆分成多行

Question

如何在Python中将长的正则表达式规则拆分成多行

65

这个可以实现吗？我有一些非常长的正则表达式模式规则，由于一次不能全部展示在屏幕上而难以理解。例如：

test = re.compile(
    '(?P<full_path>.+):\d+:\s+warning:\s+Member\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) of (class|group|namespace)\s+(?P<class_name>.+)\s+is not documented'
        % (self.__MEMBER_TYPES),
    re.IGNORECASE)

反斜杠或三重引号都不起作用。

- Makis

11

re.VERBOSE表示在正则表达式中可以包含注释，这样可以增加正则表达式的可读性。 - jfs

@J.F. Sebastian：我必须因为re.DEBUG而给予+1，这将使我的未来生活变得更加容易！ - Makis

@J.F.Sebastian：我在链接后面为你的答案点了赞，因为最终我仍然使用了它，尽管需要进行更多的编辑（必须确保每个空格都标记正确）。 - Makis

顺便提一下，@N3dst4的回答提供了一个更好的替代方案来启用语法高亮，而不是使用(?x)。此外，您还可以使用[ ]或\ 来转义空格。 - jfs

@eyquem: re.DEBUG 表示 '编译后转储模式'。 - jfs

显示剩余3条评论

6个回答

29

从文档中，字符串字面量连接：

Multiple adjacent string literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:
re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )
Note that this feature is defined at the syntactical level, but implemented at compile time. The ‘+’ operator must be used to concatenate string expressions at run time. Also note that literal concatenation can use different quoting styles for each component (even mixing raw strings and triple quoted strings).

- N3dst4

21

使用re.X或re.VERBOSE标志。除了保存引号外，此方法还可在其他正则表达式实现（如Perl）中使用。

来自文档：

re.X

re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
Corresponds to the inline flag (?x).

附言：在之前的编辑中，提问者表示他们最终采用了这个解决方案。

- Thomas Guyot-Sionnest

8

个人而言，我不使用 re.VERBOSE，因为我不想转义空格，也不想在 '\s' 不是必需时将其替换为空格。
正则表达式模式中的符号与必须捕获的字符序列相对精确，该正则表达式对象的操作速度就会越快。我几乎从不使用 '\s'

.

要避免使用 re.VERBOSE，您可以像已经说过的那样做:

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' # comment
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented'\
% (self.__MEMBER_TYPES),

re.IGNORECASE)

将字符串向左推可以提供大量的空间来编写注释。

.

但是当模式非常长时，这种方式并不好，因为无法编写

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' % (self.__MEMBER_TYPES)  # !!!!!! INCORRECT SYNTAX !!!!!!!
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented',

re.IGNORECASE)

如果要匹配的模式非常长，最后部分% (self.__MEMBER_TYPES)和应用于其上的字符串'(?P<member_type>%s)'之间的行数可能会很多，导致模式难以阅读。

.

因此，我喜欢使用元组来编写非常长的模式：

pat = ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % (self.__MEMBER_TYPES), # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

这种方式允许将模式定义为一个函数：

def pat(x):

    return ''.join((\
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % x , # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

test = re.compile(pat(self.__MEMBER_TYPES), re.IGNORECASE)

- eyquem

2

可以像naeg's answer中那样使用字符串连接，也可以使用re.VERBOSE/re.X，但要小心，此选项将忽略空格和注释。您的正则表达式中有一些空格，因此这些空格将被忽略，您需要转义它们或使用\s

例如：

test = re.compile(
    """
        (?P<full_path>.+):\d+: # some comment
        \s+warning:\s+Member\s+(?P<member_name>.+) #another comment
        \s+\((?P<member_type>%s)\)\ of\ (class|group|namespace)\s+
        (?P<class_name>.+)\s+is\ not\ documented
    """ % (self.__MEMBER_TYPES),
    re.IGNORECASE | re.X)

- stema

我先尝试了这个，但不起作用。也许我犯了一些错误，但我的最初想法是Python包括空格。至少以那种方式打印时，空格也会被打印出来。 - Makis

1

Python编译器会自动连接相邻的字符串字面量。因此，您可以将正则表达式分解为多个字符串，每个字符串一行，并让Python编译器重新组合它们。字符串之间的空格不重要，因此您可以有换行符甚至前导空格来对齐片段的含义。

- Ben

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- naeg · Accepted Answer

你可以通过引用每个片段来拆分你的正则表达式模式。不需要使用反斜杠。

test = re.compile(
    ('(?P<full_path>.+):\d+:\s+warning:\s+Member'
     '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
     'of (class|group|namespace)\s+(?P<class_name>.+)'
     '\s+is not documented'
    ) % (self.__MEMBER_TYPES),
    re.IGNORECASE)

你也可以使用原始字符串标记'r'，并且需要在每个段前加上它。

查看文档：字符串字面量拼接。