在Python中将字符串中的数字和单位分离

Question

在Python中将字符串中的数字和单位分离

9

我有一些包含数字与单位的字符串，例如2GB, 17ft等。我想要将数字与单位分开并创建两个不同的字符串。有时它们之间有空格（例如2 GB），这样使用split(' ')很容易实现。

当它们连在一起时（例如2GB），我需要测试每个字符，直到找到一个字母而不是数字。

s='17GB'
number=''
unit=''
for c in s:
    if c.isdigit():
        number+=c
    else:
        unit+=c

有更好的方法吗？

谢谢

- duduklein

1

你可能会发现，相比于正则表达式方法，你的方式更快，特别是对于你正在使用的短字符串。 - John La Rooy

14个回答

8

您可以使用正则表达式将字符串分成组：

>>> import re
>>> p = re.compile('(\d+)\s*(\w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')

- Jarret Hardie

2

为了匹配更一般的数字集，包括“6.2”和“3.4e-27”，需要一个更复杂的正则表达式。很遗憾Python没有内置的scanf模拟器。 - Christopher Bruns

5

tokenize可以帮助您：

>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
...   print token
... 
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')

- Ignacio Vazquez-Abrams

3

这种解析器已经集成到Pint中：

Pint是一个Python包，用于定义、操作和处理物理量：数值和计量单位的乘积。它允许在它们之间进行算术运算，并进行不同单位之间的转换。

您可以使用pip install pint安装它。

然后，您可以解析一个字符串，获取所需的值（“magnitude”）及其单位。

>>> from pint import UnitRegistry
>>> ureg = UnitRegistry()
>>> size = ureg('2GB')
>>> size.m
2
>>> size.u
<Unit('gigabyte')>
>>> size.to('GiB')
<Quantity(1.86264515, 'gibibyte')>
>>> length = ureg('17ft')
>>> length.m
17
>>> length.u
<Unit('foot')>
>>> length.to('cm')
<Quantity(518.16, 'centimeter')>

- Eric Duminil

2

这种方法比正则表达式更容易使用。注意：这种方法的性能不如其他已发布的解决方案。

def split_units(value):
    """
    >>> split_units("2GB")
    (2.0, 'GB')
    >>> split_units("17 ft")
    (17.0, 'ft')
    >>> split_units("   3.4e-27 frobnitzem ")
    (3.4e-27, 'frobnitzem')
    >>> split_units("9001")
    (9001.0, '')
    >>> split_units("spam sandwhiches")
    (0, 'spam sandwhiches')
    >>> split_units("")
    (0, '')
    """
    units = ""
    number = 0
    while value:
        try:
            number = float(value)
            break
        except ValueError:
            units = value[-1:] + units
            value = value[:-1]
    return number, units.strip()

- Logan Evans

2

s='17GB'
for i,c in enumerate(s):
    if not c.isdigit():
        break
number=int(s[:i])
unit=s[i:]

- John La Rooy

-1 s='17GB' 给出 unit=' GB'，即单位前面有一个空格。需要使用 lstrip 函数去除空格，然后你就会得到和我一样的答案。 - pwdyson

现在我重新阅读了问题，空格的情况是用split()处理的，而不是用这段代码。我试图把-1撤回，但它不让我这样做。 - pwdyson

2

>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB

- ghostdog74

2

您应该使用正则表达式，将您想要查找的内容分组：

import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)\s*(GB|MB|KB|B)$", s)
if match:
  print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))

根据您想解析的内容更改正则表达式，如果您不熟悉正则表达式，请阅读这个优秀的教程网站。

- AndiDog

1

尝试使用下面的正则表达式模式。第一组（用于匹配任何方式的数字的 scanf() 标记）直接取自 Python re 模块的文档。

import re
SCANF_MEASUREMENT = re.compile(
    r'''(                      # group match like scanf() token %e, %E, %f, %g
    [-+]?                      # +/- or nothing for positive
    (\d+(\.\d*)?|\.\d+)        # match numbers: 1, 1., 1.1, .1
    ([eE][-+]?\d+)?            # scientific notation: e(+/-)2 (*10^2)
    )
    (\s*)                      # separator: white space or nothing
    (                          # unit of measure: like GB. also works for no units
    \S*)''',    re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
    regular expression object that will match a measurement

    **measurement** is the value of a quantity of something. most complicated example::

        -666.6e-100 units
'''

def parse_measurement(value_sep_units):
    measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
    try:
        value = float(measurement[1])
    except ValueError:
        print("doesn't start with a number", value_sep_units)
    units = measurement[6]

    return(value, units)

- steodatus

0

对于这个任务，我肯定会使用正则表达式：

import re
there = re.compile(r'\s*(\d+)\s*(\S+)')
thematch = there.match(s)
if thematch:
  number, unit = thematch.groups()
else:
  raise ValueError('String %r not in the expected format' % s)

在 RE 模式中，\s 表示“空格”，\d 表示“数字”，\S 表示非空格；* 表示“前面的内容出现 0 次或多次”，+ 表示“前面的内容出现 1 次或多次”，括号用于“捕获组”，可以通过对匹配对象调用 groups() 方法返回这些组。（如果给定的字符串不符合模式：可选空格，然后是一个或多个数字，然后是可选空格，最后是一个或多个非空格字符，则 thematch 为 None）。

- Alex Martelli

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- pwdyson · Accepted Answer

当你找到第一个非数字字符时，可以跳出循环

for i,c in enumerate(s):
    if not c.isdigit():
        break
number = s[:i]
unit = s[i:].lstrip()

如果有负数和小数：

numeric = '0123456789-.'
for i,c in enumerate(s):
    if c not in numeric:
        break
number = s[:i]
unit = s[i:].lstrip()