如何在Python中从字符串中提取数字？

Question

如何在Python中从字符串中提取数字？

699

我想从一个字符串中提取所有数字。对于此目的，正则表达式和isdigit()方法哪个更适合？

示例：

line = "hello 12 hi 89"

结果：

[12, 89]

- pablouche

4

很遗憾，示例输入数据过于简单，导致只产生了一些朴素的解法。常规情况下，输入字符串应包含数字相邻的更有趣的字符。一个稍微具有挑战性的示例输入： '''gimme digits from "12", 34, '56', -789.''' 抱歉，样本输入数据太过简单，这会引发一些朴素的解决方案。通常，常见情况下的输入字符串应该包含比数字相邻更有趣的字符。以下是一个稍微有些具有挑战的输入示例：'''gimme digits from "12", 34, '56', -789.''' - MarkHu

20个回答

708

如果您只想提取正整数，请尝试以下方法：

>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]

我认为这种方法比正则表达式更好，因为你不需要另一个模块，并且更易于阅读，因为你不需要解析（和学习）正则表达式的小语言。

该方法无法识别浮点数、负整数或十六进制格式的整数。如果您不能接受这些限制，可以使用下面jmnas的答案来处理。

- fmark

我正在尝试从表示坐标的字符串中提取数字（43°20'30"N），但其中一些以小数结束（43°20'30.025"N）。我正在努力找出一种方法，可以提取任何非数字之间的所有数字，同时也要识别30.025是一个数字。 - MrKingsley

这个方法无法检测到0，如果它是数列中的第一个数字。 - undefined

110

你可以扩展正则表达式，以适应科学计数法。

import re

# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
       ['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
      ('hello X42 I\'m a Y-32.35 string Z30',
       ['42', '-32.35', '30']),
      ('he33llo 42 I\'m a 32 string -30', 
       ['33', '42', '32', '-30']),
      ('h3110 23 cat 444.4 rabbit 11 2 dog', 
       ['3110', '23', '444.4', '11', '2']),
      ('hello 12 hi 89', 
       ['12', '89']),
      ('4', 
       ['4']),
      ('I like 74,600 commas not,500', 
       ['74,600', '500']),
      ('I like bad math 1+2=.001', 
       ['1', '+2', '.001'])]

for s, r in ss:
    rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
    if rr == r:
        print('GOOD')
    else:
        print('WRONG', rr, 'should be', r)

给出所有好的东西！

此外，您还可以查看AWS Glue内置正则表达式。

- user3002273

101

如果你知道字符串中只有一个数字，比如 'hello 12 hi'，你可以尝试使用 filter。

例如，对于非负整数：

In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23

但要小心！！！：

In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005

- dfostic

87

我假设您想要浮点数而不仅仅是整数，因此我会这样做：

l = []
for t in s.split():
    try:
        l.append(float(t))
    except ValueError:
        pass

请注意，这里发布的其他解决方案无法处理负数：

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']

>>> '-3'.isdigit()
False

- jmnas

37

为了捕获不同的模式，使用不同的模式查询是有帮助的。

设置所有捕获感兴趣的不同数字模式的模式：

查找逗号，例如12,300或12,300.00

r'[\d]+[.,\d]+'

查找浮点数，例如0.123或.123

r'[\d]*[.][\d]+'

查找整数，例如 123

r'[\d]+'

使用管道符 ( `|` ) 将多个条件合并为一个模式。

(注意：如果将复杂的模式放在简单的模式之后，则简单模式会返回复杂模式的一部分而不是完整的匹配结果)。

p = '[\d]+[.,\d]+|[\d]*[.][\d]+|[\d]+'

下面，我们将使用re.search()确认是否存在模式，然后返回一个可迭代的捕获列表。最后，我们将使用方括号表示法打印每个捕获结果，以从匹配对象返回值中选择子集。

s = 'he33llo 42 I\'m a 32 string 30 444.4 12,001'

if re.search(p, s) is not None:
    for catch in re.finditer(p, s):
        print(catch[0]) # catch is a match object

返回：

- jameshollisandrew

这也会接受以点结尾的数字，例如 "30."。你需要像这样的东西： "[\d]+[,\d]*[.]{0,1}[\d]+"。 - katamayros

你会为负浮点数添加什么？ - xPaillant

31

我在寻找一种解决方案来去除字符串的掩码，特别是巴西电话号码。这篇帖子没有给出答案，但却给了我灵感。这是我的解决方案：

>>> phone_number = '+55(11)8715-9877'
>>> ''.join([n for n in phone_number if n.isdigit()])
'551187159877'

- Sidon

25

# extract numbers from garbage string:
s = '12//n,_@#$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]

- AndreiS

5

欢迎来到SO，并感谢您发布答案。为了解决问题，除了发布代码片段外，添加一些额外的评论和说明为什么它能解决问题是一个好习惯。请注意不要改变原文意思，尽可能让翻译通俗易懂。 - sebs

1

еҰӮжһңдҪ жңүдёҖдёӘиҫ“е…ҘеҰӮs = "Hello, world!"пјҢиҝҷдјҡйҖ жҲҗй—®йўҳпјҢеӣ дёәeдёҚиғҪеҚ•зӢ¬жө®еҠЁгҖӮжүҖд»ҘдҪ еҝ…йЎ»еҢ…еҗ«дёҖдёӘй”ҷиҜҜеӨ„зҗҶпјҢдҫӢеҰӮtry: float(i) except ValueError: ...гҖӮ - colidyre

1

如果你有一个输入如s = "Hello, world!"，这会造成问题，因为e不能单独浮动。所以你必须包含一个错误处理，例如try: float(i) except ValueError: ...。 - undefined

23

对于电话号码，您可以在正则表达式中使用\D来排除所有非数字字符：

import re

phone_number = "(619) 459-3635"
phone_number = re.sub(r"\D", "", phone_number)
print(phone_number)

r"\D" 中的 r 代表 原始字符串，这是必要的。如果没有它，Python 将会把 \D 视为转义字符。

- Antonin GAVREL

21

使用正则表达式来匹配非负数，以下是方法：

lines = "hello 12 hi 89"
import re
output = []
#repl_str = re.compile('\d+.?\d*')
repl_str = re.compile('^\d+$')
#t = r'\d+.?\d*'
line = lines.split()
for word in line:
        match = re.search(repl_str, word)
        if match:
            output.append(float(match.group()))
print (output)

使用findall函数 re.findall(r'\d+', "hello 12 hi 89")

['12', '89']

re.findall(r'\b\d+\b', "hello 12 hi 89 33F AC 777")

['12', '89', '777']

- user1464878

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vincent Savard · Accepted Answer

我会使用正则表达式。

>>> import re
>>> re.findall(r'\d+', "hello 42 I'm a 32 string 30")
['42', '32', '30']

这也将匹配来自bla42bla的42。如果您只想要由单词边界（空格、句号、逗号）分隔的数字，您可以使用\b：

>>> re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")
['42', '32', '30']

为了得到一个数字列表而不是一个字符串列表：

>>> [int(s) for s in re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")]
[42, 32, 30]

注意：这对负整数无效。

如何在Python中从字符串中提取数字？

设置所有捕获感兴趣的不同数字模式的模式：

使用管道符 ( | ) 将多个条件合并为一个模式。

使用管道符 ( `|` ) 将多个条件合并为一个模式。