有没有一种方法将数字单词转换为整数？

Question

有没有一种方法将数字单词转换为整数？

pythonstringtextintegernumbers

104

我需要将one转换为1，two转换为2等。

有没有使用库、类或其他方法来完成此操作的方式？

- Llyod

3

好的，我会尽力为您翻译。以下是需要翻译的内容：参见：https://dev59.com/LnVD5IYBdhLWcg3wJIIK - tzot

也许这个会有帮助：http://pastebin.com/WwFCjYtt - alvas

5

如果有人仍在寻找答案，我已经从下面所有答案中获得灵感，并创建了一个Python包: https://github.com/careless25/text2digits - stackErr

1

我已经使用以下示例来开发和扩展此过程，但是为了将来的参考，我将其翻译成了西班牙语：https://github.com/elbaulp/text2digits_es - Alejandro Alcalde

1

任何不寻找Python解决方案的人，这里有一个并行的C#问题：将单词（字符串）转换为Int，这里是Java的一个：在Java中将单词转换为数字。 - Tomerikoo

19个回答

43

我刚刚发布了一个Python模块到PyPI，名为word2number，旨在将数字转换为文字。https://github.com/akshaynagpal/w2n。

使用以下命令进行安装：

pip install word2number

请确保您的pip已更新到最新版本。

用法：

from word2number import w2n

print w2n.word_to_num("two million three thousand nine hundred and eighty four")
2003984

- akshaynagpal

1

尝试了你的包。建议处理像这样的字符串："1 million"或"1M"。w2n.word_to_num("1 million")会抛出一个错误。 - Ray

1

@Ray 感谢您的尝试。您能否在 https://github.com/akshaynagpal/w2n/issues 上提出问题？如果您愿意，也可以进行贡献。否则，我一定会在下一个版本中查看此问题。再次感谢！ - akshaynagpal

16

Robert，开源软件意味着人们可以协作改进它。我想要一个库，并看到其他人也想要。所以我就自己做了一个。它可能还没有达到生产级系统的水平或符合教科书上的流行术语。但是，它对于这个目的是有效的。此外，如果你能提交一份PR，它将会进一步地改善为所有用户服务，那将会很不错。 - akshaynagpal

它可以进行计算吗？比如说：19%57？或者其他运算符，例如+、6、*和/。 - S.Jackson

1

目前还没有@S.Jackson。 - akshaynagpal

18

我的输入是通过语音转文字转换得到的，因此需要一些不同的处理方式，而解决方案并不总是将数字相加。例如，“我的邮编是一二三四五”不应该被转换成“我的邮编是15”。

我采用了Andrew的答案，并进行了一些调整来处理其他一些人们指出的错误情况，并增加了对像上面提到的邮编这样的例子的支持。下面展示了一些基本测试用例，但我相信仍有改进的空间。

def is_number(x):
    if type(x) == str:
        x = x.replace(',', '')
    try:
        float(x)
    except:
        return False
    return True

def text2int (textnum, numwords={}):
    units = [
        'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight',
        'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen',
        'sixteen', 'seventeen', 'eighteen', 'nineteen',
    ]
    tens = ['', '', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety']
    scales = ['hundred', 'thousand', 'million', 'billion', 'trillion']
    ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
    ordinal_endings = [('ieth', 'y'), ('th', '')]

    if not numwords:
        numwords['and'] = (1, 0)
        for idx, word in enumerate(units): numwords[word] = (1, idx)
        for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    textnum = textnum.replace('-', ' ')

    current = result = 0
    curstring = ''
    onnumber = False
    lastunit = False
    lastscale = False

    def is_numword(x):
        if is_number(x):
            return True
        if word in numwords:
            return True
        return False

    def from_numword(x):
        if is_number(x):
            scale = 0
            increment = int(x.replace(',', ''))
            return scale, increment
        return numwords[x]

    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
            current = current * scale + increment
            if scale > 100:
                result += current
                current = 0
            onnumber = True
            lastunit = False
            lastscale = False
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if (not is_numword(word)) or (word == 'and' and not lastscale):
                if onnumber:
                    # Flush the current number we are building
                    curstring += repr(result + current) + " "
                curstring += word + " "
                result = current = 0
                onnumber = False
                lastunit = False
                lastscale = False
            else:
                scale, increment = from_numword(word)
                onnumber = True

                if lastunit and (word not in scales):                                                                                                                                                                                                                                         
                    # Assume this is part of a string of individual numbers to                                                                                                                                                                                                                
                    # be flushed, such as a zipcode "one two three four five"                                                                                                                                                                                                                 
                    curstring += repr(result + current)                                                                                                                                                                                                                                       
                    result = current = 0                                                                                                                                                                                                                                                      

                if scale > 1:                                                                                                                                                                                                                                                                 
                    current = max(1, current)                                                                                                                                                                                                                                                 

                current = current * scale + increment                                                                                                                                                                                                                                         
                if scale > 100:                                                                                                                                                                                                                                                               
                    result += current                                                                                                                                                                                                                                                         
                    current = 0                                                                                                                                                                                                                                                               

                lastscale = False                                                                                                                                                                                                              
                lastunit = False                                                                                                                                                
                if word in scales:                                                                                                                                                                                                             
                    lastscale = True                                                                                                                                                                                                         
                elif word in units:                                                                                                                                                                                                             
                    lastunit = True

    if onnumber:
        curstring += repr(result + current)

    return curstring

一些测试...

one two three -> 123
three forty five -> 345
three and forty five -> 3 and 45
three hundred and forty five -> 345
three hundred -> 300
twenty five hundred -> 2500
three thousand and six -> 3006
three thousand six -> 3006
nineteenth -> 19
twentieth -> 20
first -> 1
my zip is one two three four five -> my zip is 12345
nineteen ninety six -> 1996
fifty-seventh -> 57
one million -> 1000000
first hundred -> 100
I will buy the first thousand -> I will buy the 1000  # probably should leave ordinal in the string
thousand -> 1000
hundred and six -> 106
1 million -> 1000000

- totalhack

2

我采用了你的答案并修复了一些错误。增加了对“twenty ten”-> 2010和所有十位数的支持。你可以在这里找到它：https://github.com/careless25/text2digits - stackErr

它可以进行计算吗？比如说：19%57？或者其他运算符，例如+、6、*和/。 - S.Jackson

@S.Jackson 它不进行计算。如果你的文本片段是Python中的有效方程式，我想你可以使用它来首先将其转换为整数，然后eval结果（假设你熟悉并且对此感到舒适）。因此，“ten + five”变成“10 + 5”，然后eval（“10 + 5”）给出15。但这只能处理最简单的情况。没有浮点数，括号控制顺序，支持语音转文本中的加/减等。 - totalhack

16

如果有人感兴趣，我已经制作了一个版本来维护字符串的其余部分（虽然它可能会有错误，但我没有进行太多测试）。

def text2int (textnum, numwords={}):
    if not numwords:
        units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
        ]

        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

        scales = ["hundred", "thousand", "million", "billion", "trillion"]

        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):  numwords[word] = (1, idx)
        for idx, word in enumerate(tens):       numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
    ordinal_endings = [('ieth', 'y'), ('th', '')]

    textnum = textnum.replace('-', ' ')

    current = result = 0
    curstring = ""
    onnumber = False
    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
            current = current * scale + increment
            if scale > 100:
                result += current
                current = 0
            onnumber = True
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if word not in numwords:
                if onnumber:
                    curstring += repr(result + current) + " "
                curstring += word + " "
                result = current = 0
                onnumber = False
            else:
                scale, increment = numwords[word]

                current = current * scale + increment
                if scale > 100:
                    result += current
                    current = 0
                onnumber = True

    if onnumber:
        curstring += repr(result + current)

    return curstring

范例：

 >>> text2int("I want fifty five hot dogs for two hundred dollars.")
 I want 55 hot dogs for 200 dollars.

如果你有 "$200"，可能会出现问题。但是这只是一个粗略的估计。

- Andrew

7

我从这里以及其他地方获取了这些代码片段，并将它们制作成了一个Python库：https://github.com/careless25/text2digits。 - stackErr

12

我需要处理一些额外的解析情况，比如序数词（例如“第一”，“第二”），连字符词（例如“一百”），以及带有连字符的序数词（例如“第五十七”），因此我添加了几行代码：

def text2int(textnum, numwords={}):
    if not numwords:
        units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
        ]

        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

        scales = ["hundred", "thousand", "million", "billion", "trillion"]

        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):  numwords[word] = (1, idx)
        for idx, word in enumerate(tens):       numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
    ordinal_endings = [('ieth', 'y'), ('th', '')]

    textnum = textnum.replace('-', ' ')

    current = result = 0
    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if word not in numwords:
                raise Exception("Illegal word: " + word)

            scale, increment = numwords[word]
        
         current = current * scale + increment
         if scale > 100:
            result += current
            current = 0

    return result + current`

- Jarret Hardie

2

注意：这将为“百分之一”、“千分之一”等返回零。使用“百分之一”来获取100！ - rohithpr

1

可变默认参数是反模式。 - Neil

6

以下是简单情况的解决方案：

>>> number = {'one':1,
...           'two':2,
...           'three':3,}
>>> 
>>> number['two']
2

或者您正在寻找能够处理"十二千一百七十二"的东西吗？

- Jeff Bauer

1

这对我有帮助，谢谢。当文本来自于只有有限数量的文本选项的问卷调查等时，这是一个有用的答案。 - yeliabsalohcin

6

def parse_int(string):
    ONES = {'zero': 0,
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
            'six': 6,
            'seven': 7,
            'eight': 8,
            'nine': 9,
            'ten': 10,
            'eleven': 11,
            'twelve': 12,
            'thirteen': 13,
            'fourteen': 14,
            'fifteen': 15,
            'sixteen': 16,
            'seventeen': 17,
            'eighteen': 18,
            'nineteen': 19,
            'twenty': 20,
            'thirty': 30,
            'forty': 40,
            'fifty': 50,
            'sixty': 60,
            'seventy': 70,
            'eighty': 80,
            'ninety': 90,
              }

    numbers = []
    for token in string.replace('-', ' ').split(' '):
        if token in ONES:
            numbers.append(ONES[token])
        elif token == 'hundred':
            numbers[-1] *= 100
        elif token == 'thousand':
            numbers = [x * 1000 for x in numbers]
        elif token == 'million':
            numbers = [x * 1000000 for x in numbers]
    return sum(numbers)

在范围为1到一百万的情况下，测试了700个随机数，表现良好。

- hassan27sn

这对数以亿计的数字无效。 - Eric

4

利用Python包：WordToDigits

pip install wordtodigits

它可以在句子中找到用语言表达的数字，然后将它们转换为正确的数值格式。如果存在小数部分，则也会处理。数字的语言表述可能出现在段落的任何地方。

- Abhishek Rawat

3

如果你要解析的数字数量有限，则可以将其硬编码到字典中。

对于稍微复杂一些的情况，您可能需要基于相对简单的数字语法自动生成该字典。类似于以下内容（当然是泛化的...）

for i in range(10):
   myDict[30 + i] = "thirty-" + singleDigitsDict[i]

如果你需要更深入的内容，那么似乎你需要自然语言处理工具。这篇文章可能是一个很好的起点。

- Kena

1

有一个由Marc Burns开发的Ruby gem可以实现这一功能。我最近对其进行了分叉，以增加对年份的支持。您可以从Python中调用Ruby代码。

  require 'numbers_in_words'
  require 'numbers_in_words/duck_punch'

  nums = ["fifteen sixteen", "eighty five sixteen",  "nineteen ninety six",
          "one hundred and seventy nine", "thirteen hundred", "nine thousand two hundred and ninety seven"]
  nums.each {|n| p n; p n.in_numbers}

结果：
"十五十六" 1516 "八十五十六" 8516 "一千九百九十六" 1996 "一百七十九" 179 "一千三百" 1300 "九千二百九十七" 9297

- dimid

1

同意，但在它被移植之前，调用 Ruby 代码总比什么都不做要好。 - dimid

我觉得"十五十六"这个写法是错误的。 - PascalVKooten

@yekta 对，我认为recursive的回答在SO答案的范围内是很好的。然而，这个gem提供了一个完整的包，包括测试和其他功能。无论如何，我认为两者都有其适用的场合。 - dimid

有一个名为 inflect 的 Python 包，可以处理序数/基数和数字转换为单词的功能。 - yekta

@yekta，那么你应该在另一个答案中提出建议。 - dimid

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- recursive · Accepted Answer

大部分代码是为了设置numwords字典，这仅在第一次调用时完成。

def text2int(textnum, numwords={}):
    if not numwords:
      units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
      ]

      tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

      scales = ["hundred", "thousand", "million", "billion", "trillion"]

      numwords["and"] = (1, 0)
      for idx, word in enumerate(units):    numwords[word] = (1, idx)
      for idx, word in enumerate(tens):     numwords[word] = (1, idx * 10)
      for idx, word in enumerate(scales):   numwords[word] = (10 ** (idx * 3 or 2), 0)

    current = result = 0
    for word in textnum.split():
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        scale, increment = numwords[word]
        current = current * scale + increment
        if scale > 100:
            result += current
            current = 0

    return result + current

print text2int("seven billion one hundred million thirty one thousand three hundred thirty seven")
#7100031337

有没有一种方法将数字单词转换为整数？

115`