如何在R中将文本拆分为两个有意义的单词

Question

如何在R中将文本拆分为两个有意义的单词

3

这是我数据框df中的文本，它有一个名为'problem_note_text'的文本列

问题：纸币分配器故障/执行检查/分配器故障/要求商店拿出纸币分配器并重新安装/仍然显示错误消息，称前门未关闭/因此需要联系客户体验人员（CE）Olivia taber 01159063390 / 7am-11pm

df$problem_note_text <- tolower(df$problem_note_text)
df$problem_note_text <- tm::removeNumbers(df$problem_note_text)
df$problem_note_text<- str_replace_all(df$problem_note_text, "  ", "") # replace double spaces with single space
df$problem_note_text = str_replace_all(df$problem_note_text, pattern = "[[:punct:]]", " ")
df$problem_note_text<- tm::removeWords(x = df$problem_note_text, stopwords(kind = 'english'))
Words = all_words(df$problem_note_text, begins.with=NULL)

现在有一个包含单词列表的数据框，但其中存在像“Failureperformed”这样需要拆分成两个有意义单词的单词，如“Failure”和“performed”。我该怎么做？此外，单词数据框还包含像“im”、“h”这样没有意义的单词，需要将其删除。我不知道如何实现这一点。请保留HTML标记。

- Shweta Kamble

3

如果没有模式，就无法做到。 - akrun

5

对于像 nowhere 这样的单词，我会将其翻译为“无处可去”或“没有地方”，而不是将其分解为 “no” 和 “where” 或“now”和“here”。 - nrussell

你能分享一部分数据吗？如果“sensor advised”在你的文档中被拆成两个单词，你可以尝试更改预处理方式以避免丢失空格。 - Steve Bronder

我猜可能是因为您在原始数据中使用了连字符分隔字符（例如，sensor-advised）。如果您可以分享一些导致问题的数据（简单搜索应该能够找到导致问题的初始单词），我们可以更好地指导您。以下 qdap 文档可以帮助调试和清理文本以隔离问题：http://cran.r-project.org/web/packages/qdap/vignettes/cleaning_and_debugging.pdf - Tyler Rinker

我同意Steve_Corrin的观点，更好的标记化可能会解决这个问题，而不需要通过查找进行后连接拆分的歧义。尝试安装quanteda的dev分支：devtools :: install_github（“kbenoit / quantedaData”），然后如果您使用tokenize（df $ problem_note_text，removePunct = TRUE），那么您应该解析“sensor-advised”或这两个单词由任何非空格/非单词字符（除了“_”）分隔。 - Ken Benoit

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- josliber · Accepted Answer

给定一个英文单词列表，可以通过查找列表中每个可能的拆分来简单地完成此操作。我将使用我找到的第一个谷歌搜索结果作为我的单词列表，其中包含约70k个小写单词：

wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1

check.word <- function(x, wl) {
  x <- tolower(x)
  nc <- nchar(x)
  parts <- sapply(1:(nc-1), function(y) c(substr(x, 1, y), substr(x, y+1, nc)))
  parts[,parts[1,] %in% wl & parts[2,] %in% wl]
}

有时候这样可以解决问题：

check.word("screenunable", wl)
# [1] "screen" "unable"
check.word("nowhere", wl)
#      [,1]    [,2]  
# [1,] "no"    "now" 
# [2,] "where" "here"

但是有时候由于单词列表中没有相关单词（例如此例中缺少的 "sensor"），翻译也会失败:

check.word("sensoradvise", wl)
#     
# [1,]
# [2,]
"sensor" %in% wl
# [1] FALSE
"advise" %in% wl
# [1] TRUE