将书面数字转换为R中的数字

24

有没有人知道一种将文本中表示数字的字符串转换成实际数字的函数,例如将“二万零三百零五”转换成20305。我已经在数据框的行中写入了数字,希望将它们转换为数字。

在qdap软件包中,你可以用单词替换数字表示的数字(例如1001变成一千零一),但反过来则不行:

library(qdap)
replace_number("I like 346457 ice cream cones.")
[1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones."

@Henk 我重新改写了你的问题,以使其更明确,你需要将单词转换为数字而不是相反。 - Paul Hiemstra
2
我认为最好的做法是枪毙那个把数字写成单词提交文件的人。好吧,说真的,我怀疑除了编写一个非常详细的解析算法并具有所有数字单词(“one”,“two”,...,“hundred”,“thousand”,...,“googol”)的巨大数据库以及某种优先级树分类器之外,没有其他方法可以做到这一点。例如,在您的示例中,有两个“hundred”,但它们根据其后面的单词序列具有不同的含义。 - Carl Witthoft
3个回答

21

这是一个可以让你达到数十万的起点。

word2num <- function(word){
    wsplit <- strsplit(tolower(word)," ")[[1]]
    one_digits <- list(zero=0, one=1, two=2, three=3, four=4, five=5,
                       six=6, seven=7, eight=8, nine=9)
    teens <- list(eleven=11, twelve=12, thirteen=13, fourteen=14, fifteen=15,
                  sixteen=16, seventeen=17, eighteen=18, nineteen=19)
    ten_digits <- list(ten=10, twenty=20, thirty=30, forty=40, fifty=50,
                       sixty=60, seventy=70, eighty=80, ninety=90)
    doubles <- c(teens,ten_digits)
    out <- 0
    i <- 1
    while(i <= length(wsplit)){
        j <- 1
        if(i==1 && wsplit[i]=="hundred")
            temp <- 100
        else if(i==1 && wsplit[i]=="thousand")
            temp <- 1000
        else if(wsplit[i] %in% names(one_digits))
            temp <- as.numeric(one_digits[wsplit[i]])
        else if(wsplit[i] %in% names(teens))
            temp <- as.numeric(teens[wsplit[i]])
        else if(wsplit[i] %in% names(ten_digits))
            temp <- (as.numeric(ten_digits[wsplit[i]]))
        if(i < length(wsplit) && wsplit[i+1]=="hundred"){
            if(i>1 && wsplit[i-1] %in% c("hundred","thousand"))
                out <- out + 100*temp
            else
                out <- 100*(out + temp)
            j <- 2
        }
        else if(i < length(wsplit) && wsplit[i+1]=="thousand"){
            if(i>1 && wsplit[i-1] %in% c("hundred","thousand"))
                out <- out + 1000*temp
            else
                out <- 1000*(out + temp)
            j <- 2
        }
        else if(i < length(wsplit) && wsplit[i+1] %in% names(doubles)){
            temp <- temp*100
            out <- out + temp
        }
        else{
            out <- out + temp
        }
        i <- i + j
    }
    return(list(word,out))
}

结果:

> word2num("fifty seven")
[[1]]
[1] "fifty seven"

[[2]]
[1] 57

> word2num("four fifty seven")
[[1]]
[1] "four fifty seven"

[[2]]
[1] 457

> word2num("six thousand four fifty seven")
[[1]]
[1] "six thousand four fifty seven"

[[2]]
[1] 6457

> word2num("forty six thousand four fifty seven")
[[1]]
[1] "forty six thousand four fifty seven"

[[2]]
[1] 46457

> word2num("forty six thousand four hundred fifty seven")
[[1]]
[1] "forty six thousand four hundred fifty seven"

[[2]]
[1] 46457

> word2num("three forty six thousand four hundred fifty seven")
[[1]]
[1] "three forty six thousand four hundred fifty seven"

[[2]]
[1] 346457

我可以告诉你,对于word2num("four hundred thousand fifty")这个输入,它是无法处理连续的"百(hundred)"和"千(thousand)"的,但该算法可能可以修改。如果有人有改进或者建议,可以自由地编辑此内容作为他们自己的答案。我只是觉得这是一个有趣的问题(可以研究一会儿)。

编辑:显然,Bill Venables有一个名为english的包,可能比上述代码更好地实现了这一点。


2
试图查看英语包是否可以做到这一点。它似乎只能反向进行转换,但也许我漏掉了什么? - Tyler Rinker
如果字符串中有非数字字符,似乎也会出现错误:word2num("four apples") - rsylatian

3
我几年前写了一个R软件包来实现这个功能,https://github.com/fsingletonthorn/words_to_numbers,可以将数字转换为十的幂级别的数值。
devtools::install_github("fsingletonthorn/words_to_numbers")

library(wordstonumbers)

example_input <- "twenty thousand three hundred and five"

words_to_numbers(example_input)

[1] "20305"


它还适用于与 qdap 示例中类似的更复杂的情况:
words_to_numbers('I like three hundred forty six thousand four hundred fifty seven ice cream cones.')
[1] "I like 346457 ice cream cones."

当我尝试安装软件包时,出现了一个错误,提示“命名空间‘rlang’1.0.2已经加载,但需要>= 1.0.3。”我已经尝试卸载和重新安装rlang,但仍然遇到相同的错误。有什么建议吗? - Rasputin
1
@Rasputin ~ 我无法重现您的错误,但听起来您没有成功更新 rlang 至 v. >=1.0.3。我建议您关闭并重新打开一个新的 R 会话(即,没有任何对象在内存中),然后运行 install.packages("rlang"),检查它是否成功更新,安装完 rlang >1.0.3 后再次启动 R,然后查看在安装 words_to_numbers 时是否仍然出现错误。 - FelixST

-2

这是我认为更好的解决方案。

    library(stringdist)
    library(gdata)
    #Convert numeric words to digits
isNumericWord=function(string, dist=1, method="dl"){
  nums=c("zero","one","two","three","four","five","six","seven","eight","nine",
         "ten","eleven","twelve","thirteen","fourteen","fifteen","sixteen","seventeen","eighteen","nineteen",
         "twenty","thirty","forty","fifty","sixty","seventy","eighty","ninety",
         "hundred","thousand","million","billion","trillion")
  return(any(stringdist(tolower(string),nums,method=method)<=dist))
}
numberTypes=function(string, dist=1, method="dl"){
  nums=c("zero","one","two","three","four","five","six","seven","eight","nine",
         "ten","eleven","twelve","thirteen","fourteen","fifteen","sixteen","seventeen","eighteen","nineteen",
         "twenty","thirty","forty","fifty","sixty","seventy","eighty","ninety",
         "hundred","thousand","million","billion","trillion")
  string=gsub("[[:punct:]]"," ",string)
  wrdsplit=strsplit(string,split=" ")[[1]]
  wrdsplit=wrdsplit[wrdsplit!=""]
  #Handle number types
  wrdsplit=ifelse(stringdist("first",tolower(wrdsplit),method=method)<=dist,"one st",wrdsplit)
  wrdsplit=ifelse(stringdist("second",tolower(wrdsplit),method=method)<=dist,"two nd",wrdsplit)
  wrdsplit=ifelse(stringdist("third",tolower(wrdsplit),method=method)<=dist &
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","three rd",wrdsplit)
  wrdsplit=ifelse(stringdist("fourth",tolower(wrdsplit),method=method)<=dist & 
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","four th",wrdsplit)
  wrdsplit=ifelse(stringdist("fifth",tolower(wrdsplit),method=method)<=dist & 
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","five th",wrdsplit)
  wrdsplit=ifelse(stringdist("sixth",tolower(wrdsplit),method=method)<=dist & 
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","six th",wrdsplit)
  wrdsplit=ifelse(stringdist("seventh",tolower(wrdsplit),method=method)<=dist &
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","seven th",wrdsplit)
  wrdsplit=ifelse(stringdist("eighth",tolower(wrdsplit),method=method)<=dist &
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","eight th",wrdsplit)
  wrdsplit=ifelse(stringdist("ninth",tolower(wrdsplit),method=method)<=dist &
                    tolower(substr(wrdsplit,nchar(wrdsplit),nchar(wrdsplit)))!="y","nine th",wrdsplit)
  wrdsplit=ifelse(stringdist("tenth",tolower(wrdsplit),method=method)<=dist,"ten th",wrdsplit)
  wrdsplit=ifelse(stringdist("twentieth",tolower(wrdsplit),method=method)<=dist,"twenty th",wrdsplit)
  wrdsplit=ifelse(stringdist("thirtieth",tolower(wrdsplit),method=method)<=dist,"thirty th",wrdsplit)
  wrdsplit=ifelse(stringdist("fortieth",tolower(wrdsplit),method=method)<=dist,"forty th",wrdsplit)
  wrdsplit=ifelse(stringdist("fiftieth",tolower(wrdsplit),method=method)<=dist,"fifty th",wrdsplit)
  wrdsplit=ifelse(stringdist("sixtieth",tolower(wrdsplit),method=method)<=dist,"sixty th",wrdsplit)
  wrdsplit=ifelse(stringdist("seventieth",tolower(wrdsplit),method=method)<=dist,"seventy th",wrdsplit)
  wrdsplit=ifelse(stringdist("eightieth",tolower(wrdsplit),method=method)<=dist,"eighty th",wrdsplit)
  wrdsplit=ifelse(stringdist("ninetieth",tolower(wrdsplit),method=method)<=dist,"ninety th",wrdsplit)
  #Handle other number words that end in "th"
  if(length(wrdsplit)>0){
    for(i in 1:length(wrdsplit)){
      substr_end=substr(wrdsplit[i],(nchar(wrdsplit[i])-1),nchar(wrdsplit[i]))
      substr_beg=substr(wrdsplit[i],1,(nchar(wrdsplit[i])-2))
      if(substr_end=="th" & nchar(wrdsplit[i])!=2 & any(stringdist(tolower(substr_beg),nums,method=method)<=dist)){
        wrdsplit[i]=paste(substr_beg, substr_end,sep=" ")
      }
    }
    return(gsub("  "," ",paste(wrdsplit,collapse=" ")))
  }else{
    return("")
  }
}

#Convert number words to digits
Word2Num=function(string, dist=1, method="dl"){
  original=string
  #Define numbers
  one_digits = list(zero=0, one=1, two=2, three=3, four=4, five=5,
                    six=6, seven=7, eight=8, nine=9)
  teens = list(eleven=11, twelve=12, thirteen=13, fourteen=14, fifteen=15,
               sixteen=16, seventeen=17, eighteen=18, nineteen=19)
  ten_digits = list(ten=10, twenty=20, thirty=30, forty=40, fifty=50,
                    sixty=60, seventy=70, eighty=80, ninety=90)
  large_digits = list(hundred=100, thousand=1000, million=1e6, billion=1e9, trillion=1e12)
  double_digits = c(teens,ten_digits)

  #Split the string into words
  string=gsub("-"," ",gsub(" & ", " and ",string,ignore.case=T))
  string=numberTypes(string)
  wrdsplit=strsplit(tolower(string)," ")[[1]]
  wrdsplit=wrdsplit[wrdsplit!=""]
  isNumber=apply(data.frame(wrdsplit),1,isNumericWord)

  #Find groups of numbers
  if(exists("groups")){
    suppressWarnings(rm(groups))
  }
  i=1
  while(i <= length(wrdsplit)){
    if(isNumber[i]==T){
      if(!exists("groups")){
        groups=list(wrdsplit[i])
      }else if(exists("groups")){
        groups=c(groups, wrdsplit[i])
      }
      for(j in (i+1):length(wrdsplit)){
        if(isNumber[j]){
          groups[[length(groups)]]=c(groups[[length(groups)]],wrdsplit[j])
          i=j+1
        }else{
          i=i+1
          break
        }
      }
    }else{
      i=i+1
    }
  }

  #Convert numeric words to numbers
  if(exists("groups")){
    groupNums=groups
    for(j in 1:length(groups)){
      for(i in 1:length(groups[[j]])){
        #If word is a single digit number
        if(any(stringdist(groups[[j]][i],names(one_digits),method=method)<=dist & 
               tolower(substr(groups[[j]][i],nchar(groups[[j]][i]),nchar(groups[[j]][i])))!="y")){
          #If word is a single digit number
          groupNums[[j]][i]=one_digits[stringdist(groups[[j]][i],names(one_digits),method=method)<=dist][[1]]
        }else if(any(stringdist(groups[[j]][i],names(double_digits),method=method)<=dist)){
          #If word is a double digit number
          groupNums[[j]][i]=double_digits[stringdist(groups[[j]][i],names(double_digits),method=method)<=dist][[1]]
        }else if(any(stringdist(groups[[j]][i],names(large_digits),method=method)<=dist)){
          #If word is a large digit number
          groupNums[[j]][i]=large_digits[stringdist(groups[[j]][i],names(large_digits),method=method)<=dist][[1]]
        }
      }
    }

    #Convert the separated numbers to a single number
    defscipen=options("scipen")[[1]]
    options(scipen=999)
    for(i in 1:length(groups)){
      if(length(groupNums[[i]])==1){
        groupNums[[i]]=as.numeric(groupNums[[i]][1])
      }else{
        while(length(groupNums[[i]])>=2){
          if(nchar(groupNums[[i]][2])>nchar(groupNums[[i]][1])){
            #If the next word has more digits than the current word, multiply them
            temp=as.numeric(groupNums[[i]][1])*as.numeric(groupNums[[i]][2])
          }else if(nchar(groupNums[[i]][2])<nchar(groupNums[[i]][1])){
            #if the next word has less digits than the current word, add them
            temp=as.numeric(groupNums[[i]][1])+as.numeric(groupNums[[i]][2])
          }
          #Combine the results
          if(length(groupNums[[i]])>2){
            groupNums[[i]]=c(temp, groupNums[[i]][3:length(groupNums[[i]])])
          }else{
            groupNums[[i]]=temp
          }
        }
      }
    }
    #Recreate the original string
    groupNums=lapply(groupNums, as.character)
    options(scipen=defscipen)
    for(i in 1:length(groups)){
      wrdsplit[which(wrdsplit==groups[[i]][1])]=groupNums[[i]][1]
      if(length(groups[[i]]>1)){
        wrdsplit[which(wrdsplit==groups[[i]][2:length(groups)])]=""
      }
    }
    #Combine numbers with their endings
    wrdsplit=wrdsplit[wrdsplit!=""]
    if(any(wrdsplit[which(wrdsplit %in% unlist(groupNums))+1] %in% c("rd","th","st","nd"))){
      locs=which(wrdsplit %in% unlist(groupNums))
      for(i in length(locs):1){
        wrdsplit[locs[i]]=paste(wrdsplit[c(locs[i],(locs[i]+1))],collapse="")
        wrdsplit=wrdsplit[-(locs[i]+1)]
      }
    }
    return(trim(paste(wrdsplit,collapse=" ")))
  }else{
    return(original)
  }
}

很遗憾,这段代码不起作用。以下是一些测试(在运行后):
isNumericWord("one hundred") [1] FALSE Word2Num("one hundred") Error in groups[[j]][i] : object of type 'closure' is not subsettable isNumericWord("100") [1] FALSE Word2Num("five thousand") Error in groups[[j]][i] : object of type 'closure' is not subsettable
- Rasputin

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接