Python：正则表达式无法正常工作

Question

Python：正则表达式无法正常工作

3

我正在使用以下正则表达式，它应该可以找到字符串'U.S.A.'，但它只能得到'A.'，有人知道是什么问题吗？

#INPUT
import re

text = 'That U.S.A. poster-print costs $12.40...'

print re.findall(r'([A-Z]\.)+', text)

#OUTPUT
['A.']

期望输出：

['U.S.A.']

我正在学习NLTK书籍第3.7章，在这里，它有一组正则表达式但是它们不起作用。我已经尝试过在Python 2.7和3.4中使用它们。

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

nltk.regexp_tokenize()与re.findall()的功能相同，我认为我的Python代码无法按预期识别正则表达式。上面列出的正则表达式输出如下：

[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

- LingxB

既然您没有提到一个模式，而且如果您的唯一动机是找到“U.S.A.”，那么使用“(U.S.A.)”就足够了。 - user2705585

请参阅https://github.com/nltk/nltk/issues/1206和http://stackoverflow.com/questions/32300437/python-parsing-user-input-using-a-verbose-regex以及https://dev59.com/6X3aa4cB1Zd3GeqPZzS-. - alvas

4个回答

2

去掉末尾的+，或将其放在组内：

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.']              # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.']  # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.']          # with '+' inside the group

- Andrea Corbellini

1

问题出在"捕获组"(即括号)，它对findall()的结果产生了意想不到的影响：当一个捕获组在匹配中被多次使用时，正则表达式引擎会失去追踪，从而导致奇怪的事情发生。具体来说：正则表达式正确地匹配了整个U.S.A.，但findall却将其丢弃，只返回最后一个捕获组。

正如这个答案所说，re模块不支持重复的捕获组，但你可以安装另一个regexp模块，它可以正确处理这个问题。(然而，如果你想将正则表达式传递给nltk.tokenize.regexp，这并没有帮助。)

无论如何，要正确匹配U.S.A.，请使用以下代码：r'(?:[A-Z]\.)+', text)。

>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']

你可以将同样的修复应用于NLTK regexp中的所有重复模式，一切都会正常工作。正如@alvas所建议的那样，NLTK曾经在幕后进行这种替换，但最近这个功能被取消了，并在分词器的文档中用警告替换。这本书显然已经过时了; @alvas在11月份就提交了一个错误报告，但尚未得到处理...

- alexis

1

正则表达式匹配的第一个部分是“U.S.A.”，因为([A-Z]\.)+匹配了括号内的第一个组件（部分）。但是每个组只能返回一个匹配，所以Python选择该组的最后一个匹配。

如果您改变正则表达式以在组中包含“+”，则该组仅匹配一次，并将返回完整匹配。例如(([A-Z]\.)+)或((?:[A-Z]\.)+)。

如果您希望获得三个单独的结果，则只需在正则表达式中去掉“+”符号，它将仅匹配每次一个字母和一个点。

- Jonas Berlin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alvas · Accepted Answer

也许这与之前使用 nltk.internals.compile_regexp_to_noncapturing() 编译正则表达式的方式有关，但在 v3.1 中被废弃了，请参见此处。

>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> 
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

但在 NLTK v3.1 中无法正常工作：

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

通过略微修改如何定义您的正则表达式组，您可以在NLTK v3.1中使用此正则表达式来获得相同的模式：

pattern = r"""(?x)                   # set flag to allow verbose regexps
              (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
              |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
              |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
              |(?:[+/\-@&*])         # special characters with meanings
            """

在代码中：

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x)                   # set flag to allow verbose regexps
... (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*])         # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

没有NLTK，使用Python的re模块，我们发现旧的正则表达式模式不被原生支持：

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               |\w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               |[+/\-@&*]        # special characters with meanings
...               |\S\w*                       # any sequence of word characters# 
... """            
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps
...                       (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
...                       |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
...                       |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
...                       |(?:[+/\-@&*])         # special characters with meanings
...                     """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

注意： NLTK的RegexpTokenizer编译正则表达式方式的更改，使得NLTK的正规表达式分词器上的示例也过时了。