有没有一种方法将数字单词转换为整数?

104

我需要将one转换为1two转换为2等。

有没有使用库、类或其他方法来完成此操作的方式?


3
好的,我会尽力为您翻译。以下是需要翻译的内容:参见:https://dev59.com/LnVD5IYBdhLWcg3wJIIK - tzot
也许这个会有帮助:http://pastebin.com/WwFCjYtt - alvas
5
如果有人仍在寻找答案,我已经从下面所有答案中获得灵感,并创建了一个Python包: https://github.com/careless25/text2digits - stackErr
1
我已经使用以下示例来开发和扩展此过程,但是为了将来的参考,我将其翻译成了西班牙语:https://github.com/elbaulp/text2digits_es - Alejandro Alcalde
1
任何不寻找Python解决方案的人,这里有一个并行的C#问题:将单词(字符串)转换为Int,这里是Java的一个:在Java中将单词转换为数字 - Tomerikoo
19个回答

142
大部分代码是为了设置numwords字典,这仅在第一次调用时完成。
def text2int(textnum, numwords={}):
    if not numwords:
      units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
      ]

      tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

      scales = ["hundred", "thousand", "million", "billion", "trillion"]

      numwords["and"] = (1, 0)
      for idx, word in enumerate(units):    numwords[word] = (1, idx)
      for idx, word in enumerate(tens):     numwords[word] = (1, idx * 10)
      for idx, word in enumerate(scales):   numwords[word] = (10 ** (idx * 3 or 2), 0)

    current = result = 0
    for word in textnum.split():
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        scale, increment = numwords[word]
        current = current * scale + increment
        if scale > 100:
            result += current
            current = 0

    return result + current

print text2int("seven billion one hundred million thirty one thousand three hundred thirty seven")
#7100031337

1
顺便提一下,这个函数不能用于日期。你可以试试这个: `print text2int("nineteen ninety six")

115`

- Nick Ruiz
28
1996的正确英文单词书写方式为 "one thousand nine hundred ninety six"。如果需要支持年份,您需要使用不同的代码。 - recursive
1
它会在“一百零六”尝试中出错。打印(text2int(“一百零六”))..还要打印(text2int(“千”))。 - Harish Kayarohanam
一个重要的注意事项是,它只适用于小写句子。请确保您传递小写句子或使用小写单词变量。 - MikeL
2
人们的期望不尽相同。就我个人而言,我期望当输入无效数字时,它不会被称为“两”。 - recursive
显示剩余4条评论

43
我刚刚发布了一个Python模块到PyPI,名为word2number,旨在将数字转换为文字。https://github.com/akshaynagpal/w2n
使用以下命令进行安装:
pip install word2number

请确保您的pip已更新到最新版本。
用法:
from word2number import w2n

print w2n.word_to_num("two million three thousand nine hundred and eighty four")
2003984

1
尝试了你的包。建议处理像这样的字符串:"1 million""1M"。w2n.word_to_num("1 million")会抛出一个错误。 - Ray
1
@Ray 感谢您的尝试。您能否在 https://github.com/akshaynagpal/w2n/issues 上提出问题?如果您愿意,也可以进行贡献。否则,我一定会在下一个版本中查看此问题。 再次感谢! - akshaynagpal
16
Robert,开源软件意味着人们可以协作改进它。我想要一个库,并看到其他人也想要。所以我就自己做了一个。它可能还没有达到生产级系统的水平或符合教科书上的流行术语。但是,它对于这个目的是有效的。此外,如果你能提交一份PR,它将会进一步地改善为所有用户服务,那将会很不错。 - akshaynagpal
它可以进行计算吗?比如说:19%57?或者其他运算符,例如+、6、*和/。 - S.Jackson
1
目前还没有@S.Jackson。 - akshaynagpal

18

我的输入是通过语音转文字转换得到的,因此需要一些不同的处理方式,而解决方案并不总是将数字相加。例如,“我的邮编是一二三四五”不应该被转换成“我的邮编是15”。

我采用了Andrew的答案,并进行了一些调整来处理其他一些人们指出的错误情况,并增加了对像上面提到的邮编这样的例子的支持。下面展示了一些基本测试用例,但我相信仍有改进的空间。

def is_number(x):
    if type(x) == str:
        x = x.replace(',', '')
    try:
        float(x)
    except:
        return False
    return True

def text2int (textnum, numwords={}):
    units = [
        'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight',
        'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen',
        'sixteen', 'seventeen', 'eighteen', 'nineteen',
    ]
    tens = ['', '', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety']
    scales = ['hundred', 'thousand', 'million', 'billion', 'trillion']
    ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
    ordinal_endings = [('ieth', 'y'), ('th', '')]

    if not numwords:
        numwords['and'] = (1, 0)
        for idx, word in enumerate(units): numwords[word] = (1, idx)
        for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    textnum = textnum.replace('-', ' ')

    current = result = 0
    curstring = ''
    onnumber = False
    lastunit = False
    lastscale = False

    def is_numword(x):
        if is_number(x):
            return True
        if word in numwords:
            return True
        return False

    def from_numword(x):
        if is_number(x):
            scale = 0
            increment = int(x.replace(',', ''))
            return scale, increment
        return numwords[x]

    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
            current = current * scale + increment
            if scale > 100:
                result += current
                current = 0
            onnumber = True
            lastunit = False
            lastscale = False
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if (not is_numword(word)) or (word == 'and' and not lastscale):
                if onnumber:
                    # Flush the current number we are building
                    curstring += repr(result + current) + " "
                curstring += word + " "
                result = current = 0
                onnumber = False
                lastunit = False
                lastscale = False
            else:
                scale, increment = from_numword(word)
                onnumber = True

                if lastunit and (word not in scales):                                                                                                                                                                                                                                         
                    # Assume this is part of a string of individual numbers to                                                                                                                                                                                                                
                    # be flushed, such as a zipcode "one two three four five"                                                                                                                                                                                                                 
                    curstring += repr(result + current)                                                                                                                                                                                                                                       
                    result = current = 0                                                                                                                                                                                                                                                      

                if scale > 1:                                                                                                                                                                                                                                                                 
                    current = max(1, current)                                                                                                                                                                                                                                                 

                current = current * scale + increment                                                                                                                                                                                                                                         
                if scale > 100:                                                                                                                                                                                                                                                               
                    result += current                                                                                                                                                                                                                                                         
                    current = 0                                                                                                                                                                                                                                                               

                lastscale = False                                                                                                                                                                                                              
                lastunit = False                                                                                                                                                
                if word in scales:                                                                                                                                                                                                             
                    lastscale = True                                                                                                                                                                                                         
                elif word in units:                                                                                                                                                                                                             
                    lastunit = True

    if onnumber:
        curstring += repr(result + current)

    return curstring

一些测试...

one two three -> 123
three forty five -> 345
three and forty five -> 3 and 45
three hundred and forty five -> 345
three hundred -> 300
twenty five hundred -> 2500
three thousand and six -> 3006
three thousand six -> 3006
nineteenth -> 19
twentieth -> 20
first -> 1
my zip is one two three four five -> my zip is 12345
nineteen ninety six -> 1996
fifty-seventh -> 57
one million -> 1000000
first hundred -> 100
I will buy the first thousand -> I will buy the 1000  # probably should leave ordinal in the string
thousand -> 1000
hundred and six -> 106
1 million -> 1000000

2
我采用了你的答案并修复了一些错误。增加了对“twenty ten”-> 2010和所有十位数的支持。你可以在这里找到它:https://github.com/careless25/text2digits - stackErr
它可以进行计算吗?比如说:19%57?或者其他运算符,例如+、6、*和/。 - S.Jackson
@S.Jackson 它不进行计算。如果你的文本片段是Python中的有效方程式,我想你可以使用它来首先将其转换为整数,然后eval结果(假设你熟悉并且对此感到舒适)。因此,“ten + five”变成“10 + 5”,然后eval(“10 + 5”)给出15。但这只能处理最简单的情况。没有浮点数,括号控制顺序,支持语音转文本中的加/减等。 - totalhack

16

如果有人感兴趣,我已经制作了一个版本来维护字符串的其余部分(虽然它可能会有错误,但我没有进行太多测试)。

def text2int (textnum, numwords={}):
    if not numwords:
        units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
        ]

        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

        scales = ["hundred", "thousand", "million", "billion", "trillion"]

        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):  numwords[word] = (1, idx)
        for idx, word in enumerate(tens):       numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
    ordinal_endings = [('ieth', 'y'), ('th', '')]

    textnum = textnum.replace('-', ' ')

    current = result = 0
    curstring = ""
    onnumber = False
    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
            current = current * scale + increment
            if scale > 100:
                result += current
                current = 0
            onnumber = True
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if word not in numwords:
                if onnumber:
                    curstring += repr(result + current) + " "
                curstring += word + " "
                result = current = 0
                onnumber = False
            else:
                scale, increment = numwords[word]

                current = current * scale + increment
                if scale > 100:
                    result += current
                    current = 0
                onnumber = True

    if onnumber:
        curstring += repr(result + current)

    return curstring

范例:

 >>> text2int("I want fifty five hot dogs for two hundred dollars.")
 I want 55 hot dogs for 200 dollars.

如果你有 "$200",可能会出现问题。但是这只是一个粗略的估计。


7
我从这里以及其他地方获取了这些代码片段,并将它们制作成了一个Python库:https://github.com/careless25/text2digits。 - stackErr

12

我需要处理一些额外的解析情况,比如序数词(例如“第一”,“第二”),连字符词(例如“一百”),以及带有连字符的序数词(例如“第五十七”),因此我添加了几行代码:

def text2int(textnum, numwords={}):
    if not numwords:
        units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
        ]

        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

        scales = ["hundred", "thousand", "million", "billion", "trillion"]

        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):  numwords[word] = (1, idx)
        for idx, word in enumerate(tens):       numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
    ordinal_endings = [('ieth', 'y'), ('th', '')]

    textnum = textnum.replace('-', ' ')

    current = result = 0
    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if word not in numwords:
                raise Exception("Illegal word: " + word)

            scale, increment = numwords[word]
        
         current = current * scale + increment
         if scale > 100:
            result += current
            current = 0

    return result + current`

2
注意:这将为“百分之一”、“千分之一”等返回零。使用“百分之一”来获取100! - rohithpr
1
可变默认参数是反模式。 - Neil

6

以下是简单情况的解决方案:

>>> number = {'one':1,
...           'two':2,
...           'three':3,}
>>> 
>>> number['two']
2

或者您正在寻找能够处理"十二千一百七十二"的东西吗?


1
这对我有帮助,谢谢。当文本来自于只有有限数量的文本选项的问卷调查等时,这是一个有用的答案。 - yeliabsalohcin

6
def parse_int(string):
    ONES = {'zero': 0,
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
            'six': 6,
            'seven': 7,
            'eight': 8,
            'nine': 9,
            'ten': 10,
            'eleven': 11,
            'twelve': 12,
            'thirteen': 13,
            'fourteen': 14,
            'fifteen': 15,
            'sixteen': 16,
            'seventeen': 17,
            'eighteen': 18,
            'nineteen': 19,
            'twenty': 20,
            'thirty': 30,
            'forty': 40,
            'fifty': 50,
            'sixty': 60,
            'seventy': 70,
            'eighty': 80,
            'ninety': 90,
              }

    numbers = []
    for token in string.replace('-', ' ').split(' '):
        if token in ONES:
            numbers.append(ONES[token])
        elif token == 'hundred':
            numbers[-1] *= 100
        elif token == 'thousand':
            numbers = [x * 1000 for x in numbers]
        elif token == 'million':
            numbers = [x * 1000000 for x in numbers]
    return sum(numbers)

在范围为1到一百万的情况下,测试了700个随机数,表现良好。


这对数以亿计的数字无效。 - Eric

4

利用Python包:WordToDigits

pip install wordtodigits

它可以在句子中找到用语言表达的数字,然后将它们转换为正确的数值格式。如果存在小数部分,则也会处理。数字的语言表述可能出现在段落的任何地方。

3

如果你要解析的数字数量有限,则可以将其硬编码到字典中。

对于稍微复杂一些的情况,您可能需要基于相对简单的数字语法自动生成该字典。类似于以下内容(当然是泛化的...)

for i in range(10):
   myDict[30 + i] = "thirty-" + singleDigitsDict[i]

如果你需要更深入的内容,那么似乎你需要自然语言处理工具。 这篇文章可能是一个很好的起点。


1

有一个由Marc Burns开发的Ruby gem可以实现这一功能。我最近对其进行了分叉,以增加对年份的支持。您可以从Python中调用Ruby代码

  require 'numbers_in_words'
  require 'numbers_in_words/duck_punch'

  nums = ["fifteen sixteen", "eighty five sixteen",  "nineteen ninety six",
          "one hundred and seventy nine", "thirteen hundred", "nine thousand two hundred and ninety seven"]
  nums.each {|n| p n; p n.in_numbers}

结果:
"十五十六" 1516 "八十五十六" 8516 "一千九百九十六" 1996 "一百七十九" 179 "一千三百" 1300 "九千二百九十七" 9297


1
同意,但在它被移植之前,调用 Ruby 代码总比什么都不做要好。 - dimid
我觉得"十五十六"这个写法是错误的。 - PascalVKooten
@yekta 对,我认为recursive的回答在SO答案的范围内是很好的。然而,这个gem提供了一个完整的包,包括测试和其他功能。无论如何,我认为两者都有其适用的场合。 - dimid
有一个名为 inflect 的 Python 包,可以处理序数/基数和数字转换为单词的功能。 - yekta
@yekta,那么你应该在另一个答案中提出建议。 - dimid
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接