从一个单词中提取字符串中的数字

Question

从一个单词中提取字符串中的数字

4

我需要一个正则表达式，仅返回单词中的数字，但我只能找到返回字符串中所有数字的表达式。

我使用了这个例子： text =“我需要这个数字在我的wor5d内，但也需要这个word3和这个4word，但不是1和不是555。” 以下代码返回所有数字，但我只关心['5'，'3'，'4'] import re print(re.findall(r'\d+'， text)) 有什么建议吗？

- Kiri

2

那么，只需要字母旁边的数字？re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)？ - Wiktor Stribiżew

2个回答

-1

一种使用str.translate的方法，不需要使用正则表达式或re模块：

from string import ascii_letters

delete_dict = {sp_character: '' for sp_character in ascii_letters}
table = str.maketrans(delete_dict)

text = 'I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'

print([res for s in text.rstrip('.').split()
       if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])

输出：

['5', '3', '4']

性能

我很好奇，所以我进行了一些基准测试，以比较性能与其他方法。看起来 str.translate 的速度甚至比正则表达式实现还要快。

这是我的基准代码，使用 timeit：

import re
from string import ascii_letters
from timeit import timeit


_NUM_RE = re.compile(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])')

delete_dict = {sp_character: '' for sp_character in ascii_letters}
_TABLE = str.maketrans(delete_dict)

text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'


def main():
    n = 100_000

    print('regex:         ', timeit("re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)",
                 globals=globals(), number=n))

    print('regex (opt):   ', (timeit("_NUM_RE.findall(text)",
                 globals=globals(), number=n)))

    print('iter_char:     ', timeit("""
k=set()
for x in range(1,len(text)-1):
    if text[x-1].isdigit() and text[x].isalpha():
        k.add(text[x-1])
    if text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit():
        k.add(text[x])
    """, globals=globals(), number=n))

    print('str.translate: ', timeit("""
[
    res for s in text.rstrip('.').split()
    if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
]
    """, globals=globals(), number=n))


if __name__ == '__main__':
    main()

结果（Mac OS X - M1）：

regex:          0.5315765410050517
regex (opt):    0.5069837079936406
iter_char:      2.5037198749923846
str.translate:  0.37348733299586456

- rv.kvetch

这不是随机的。对于像 1:5 或 55! 这样的字符串，它不起作用。 - Ryszard Czech

@RyszardCzech 我误解了问题。请查看我上面更新的代码。请注意它仍然比正则表达式方法稍微快一些。 - rv.kvetch

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

1

您可以使用

re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)

这个正则表达式将提取所有紧随或紧跟ASCII字母的一个或多个数字块。

Python re 的完全Unicode版本如下：

(?<=[^\W\d_])\d+|\d+(?=[^\W\d_])

[^\W\d_] 匹配任何 Unicode 字母。

参考正则表达式演示。

- Wiktor Stribiżew

1

[^\W\d_]并不完全匹配任何Unicode字母。实际上，它不是基于Unicode定义的\w或\W。一个符合Unicode标准的\w版本包括\p{gc=Mark}中的字符，而re模块将它们包含在\W中。与regex模块相比，它具有更符合Unicode标准的\w和\W实现。Python文档很少指出它与Unicode的区别。 - Andj

@Andj 请查看匹配任何Unicode字母？ - Wiktor Stribiżew

1

@wiktor_stribiżew，考虑到您提供的示例未使用任何与\p{gc=Mark}相匹配的内容。编译模式pattern = re.compile(r'[^\w]', re.U)，然后尝试re.sub(pattern, "", 'Stribiżew')，然后尝试re.sub(pattern, "", unicodedata.normalize("NFD",'Stribiżew'))。第一个将给您Stribiżew，第二个将给您去除组合字符的Stribizew。 - Andj

1

一个 Unicode 兼容的 \w 实现会匹配 U+0307，但是 re 模块不会。 - Andj