正则表达式：匹配可选范围的数字

Question

正则表达式：匹配可选范围的数字

3

使用Python的re模块，我尝试从以下陈述中获取金额数值：

"$305,000 - $349,950" 应返回 (305000, 349950) 这样的元组
"Mid $2M's Buyers" --> (2000000)
"... Buyers Guide $1.29M+" --> (1290000)
"...$485,000 and $510,000" --> (485000, 510000)

下面的模式适用于单个金额数值，但如果是一段范围（如上述第一个和最后一个点），则只返回最后一个数字（即 349950 和 510000）。

_pattern = r"""(?x)
    ^
    .*
    (?P<target1>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer1>[kKmM]?\s?[mM]?)
    )
    (?:\s(?:\-|\band\b|\bto\b)\s)?
    (?P<target2>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer2>[kKmM]?\s?[mM]?)
    )?
    .*?
    $
    """

在尝试target2 = match.group("target2").strip()时，target2总是显示为None。

我绝不是一个正则表达式专家，但是我真的看不出我在这里做错了什么。乘法器组起作用，对我来说，target2组似乎是相同的模式，即结尾处的可选匹配。

我希望我表达得比较清楚...

- pandita

2个回答

2

使用详细模式来匹配正则表达式的+1。

模式开头的.*是贪婪的，所以它试图匹配整行。然后它回溯来匹配target1。模式中的其他所有内容都是可选的，因此将target1与行上的最后一个匹配进行匹配即为成功匹配。您可以尝试通过添加“？”使第一个.*不贪婪，如下所示：

_pattern = r"""(?x)
    ^
    .*?                   <-- add the ?
    (?P<target1>
    ... snip ...
    """

你能逐步完成吗？

_pattern = r"""(?x)
    (?P<target1>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer1>[kKmM]?\s?[mM]?)
    )
    (?P<more>\s(?:\-|\band\b|\bto\b)\s)?
    """

match = re.search(_pattern, line)
target1, more = match.groups()
if more:
    target2 = re.search(_pattern, line, start=match.end())

编辑还有一个想法：尝试使用re.findall()：

_pattern = r"""(?x)
    (?P<target1>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer1>[kKmM]?\s?[mM]?)
    )
"""

targets = re.findall(_pattern, line)

- RootTwo

很遗憾，第一个选项没有起作用。虽然它报告了范围内的第一个数字，但结果仍然相同。我认为你的第二个建议是个好主意，但它也没用。顺便说一下，语法似乎是 re.search(pattern, line)。无论如何，“more”组始终似乎为空... - pandita

修正了对 re.search() 的调用。 - RootTwo

你能只使用模式匹配target1来使用re.findall()吗？它应该返回所有匹配项的列表。 - RootTwo

没问题，我可以让它工作 :) 没有<more>组，它会返回例如 ([('$305,000 ', ' '), ('$349,950', '')], 349950) 的范围！我认为在more组中有一些正则表达式的问题，但是使用re.findall()很好。如果你在你的答案中加入一些关于这个的东西，我会标记它为已接受。感谢你的帮助。 - pandita

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jan · Accepted Answer

你可以结合一个将缩写数字转换为完整数字的函数，使用一些正则表达式逻辑来解决问题。这里是一些示例 Python 代码：

# -*- coding: utf-8> -*-
import re, locale
from locale import *
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

string = """"$305,000 - $349,950"
"Mid $2M's Buyers"
"... Buyers Guide $1.29M+"
"...$485,000 and $510,000"
"""

def convert_number(number, unit):
    if unit == "K":
        exp = 10**3
    elif unit == "M":
        exp = 10**6
    return (atof(number) * exp)

matches = []
rx = r"""
    \$(?P<value>\d+[\d,.]*)         # match a dollar sign 
                                    # followed by numbers, dots and commas
                                    # make the first digit necessary (+)
    (?P<unit>M|K)?                  # match M or K and save it to a group
    (                               # opening parenthesis
        \s(?:-|and)\s               # match a whitespace, dash or "and"
        \$(?P<value1>\d+[\d,.]*)    # the same pattern as above
        (?P<unit1>M|K)?
    )?                              # closing parethesis, 
                                    # make the whole subpattern optional (?)
"""
for match in re.finditer(rx, string, re.VERBOSE):
    if match.group('unit') is not None:
        value1 = convert_number(match.group('value'), match.group('unit'))
    else:
        value1 = atof(match.group('value'))
    m = (value1)
    if match.group('value1') is not None:
        if match.group('unit1') is not None:
            value2 = convert_number(match.group('value1'), match.group('unit1'))
        else:
            value2 = atof(match.group('value1'))
        m = (value1, value2)
    matches.append(m)

print matches
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]

这段代码使用了相当多的逻辑，首先导入locale模块以使用atof()函数，定义了一个名为convert_number()的函数，并使用正则表达式搜索范围，该正则表达式在代码中有详细说明。你可以显然添加其他货币符号，如€$£，但在原始示例中没有出现。