在R中如何去除非字母字符并将所有字母转换为小写？

Question

在R中如何去除非字母字符并将所有字母转换为小写？

4

在以下字符串中：

"I may opt for a yam for Amy, May, and Tommy."

如何在R中删除非字母字符、将所有字母转换为小写，并对每个单词内的字母进行排序？

同时，我尝试对句子中的单词进行排序并删除重复项。

- Yanyan

3

你能展示一下你到目前为止尝试了什么吗？what have you tried - zero323

1

你能提供一个示例字符串和预期输出吗？要转换为小写，只需使用 tolower。 - Molx

3

将每个单词中的字母排序。 - hrbrmstr

4个回答

5

您可以使用 stringi。

library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))

这将会得到：

## [1] "a"     "amy"   "and"   "for"   "i"     "may"   "opt"   "tommy" "yam"

更新

如@DavidArenburg所提到的，我忽略了你问题中“对单词内的字母进行排序”的部分。你没有提供所需的输出，也没有立即应用的想法，但是假设你想要识别哪些单词有匹配的对应项（字符串距离为0）：

unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
  stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%

#       a amy and for i may opt tommy yam
# a     0   2   2   4 2   2   4     6   2
# amy   2   0   4   6 4   0   6     4   0
# and   2   4   0   6 4   4   6     8   4
# for   4   6   6   0 4   6   4     6   6
# i     2   4   4   4 0   4   4     6   4
# may   2   0   4   6 4   0   6     4   0
# opt   4   6   6   4 4   6   0     4   6
# tommy 6   4   8   6 6   4   4     0   4
# yam   2   0   4   6 4   0   6     4   0

  apply(., 1, function(x) sum(x == 0, na.rm=TRUE)) 

# a   amy   and   for     i   may   opt tommy   yam 
# 1     3     1     1     1     3     1     1     3

在每行中有一个以上 0 的单词（例如"amy"，"may"，"yam"）会有一个 混淆的 对应词。

- Steven Beaupré

1

我现在倾向于使用stringr，因为它在底层使用了stringi，但是stri_extract_all_words函数看起来非常方便。我可能需要重新开始使用stringi。 - hrbrmstr

1

是的。stringr 更简单，但我觉得 stringi 更加灵活。 - Steven Beaupré

@hrbrmstr 我觉得你们都忽略了“对每个单词内部的字母进行排序”的要求。 - David Arenburg

1

但这到底是什么意思？ - hrbrmstr

1

@DavidArenburg，OP确实要求对单词内的字母进行排序。这对我来说没有意义。帖子有点简陋。我认为如果OP提供所需的输出，他们的问题会更清晰，因为他们所要求的似乎没有明显的直接应用。 - Tyler Rinker

4

str <- "I may opt for a yam for Amy, May, and Tommy."

## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]

## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))

## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")

# i     may     opt     for       a     yam     for     amy     may     and 
# "i"   "amy"   "opt"   "for"     "a"   "amy"   "for"   "amy"   "amy"   "adn" 
# tommy 
# "mmoty" 

## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"

- Rorschach

4

我维护的qdap包中有一个bag_o_words函数非常适合这个任务：

txt <- "I may opt for a yam for Amy, May, and Tommy."

library(qdap)

unique(sort(bag_o_words(txt)))

## [1] "a"     "amy"   "and"   "for"   "i"     "may"   "opt"   "tommy" "yam"

- Tyler Rinker

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hrbrmstr · Accepted Answer

stringr 可以让你在 R 中使用所有字符集，并且可以在 C 速度下工作。而 magrittr 则可以让你使用管道习语，非常适合你的需求：

library(stringr)
library(magrittr)

txt <- "I may opt for a yam for Amy, May, and Tommy."

txt %>% 
  str_to_lower %>%                                            # lowercase
  str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>%    # only alpha
  str_replace_all("[[:space:]]+", " ") %>%                    # single spaces
  str_split(" ") %>%                                          # tokenize
  extract2(1) %>%                                             # str_split returns a list
  sort %>%                                                    # sort
  unique                                                      # unique words

  ## [1] "a"     "amy"   "and"   "for"   "i"     "may"   "opt"   "tommy" "yam"