从向量中删除作为另一个字符串的子字符串的元素。

5

有更好的方法吗?我想从这个向量中删除所有作为其他元素的子字符串存在的字符串。

words = c("please can you", 
  "please can", 
  "can you", 
  "how did you", 
  "did you",
  "have you")
> words
[1] "please can you" "please can"     "can you"        "how did you"    "did you"        "have you"

library(data.table)
library(stringr)
dt = setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE))
dt[, found := str_detect(word1, word2)]
setdiff(words, dt[found == TRUE & word1 != word2, word2])
[1] "please can you" "how did you"    "have you" 

这个方法可行,但看起来有些冗长,我想知道更加优雅的解决方案。

3
CJ 是比 data.table 更快的 expand.grid 函数。 - Rorschach
只是想为那些跟进的人提供更多信息。CJ速度快得多。我拿了12431行,平均每行15.69个单词,总共195065个单词,并通过system.time(dt <- setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE)))运行它,在user system elapsed 8.414 3.387 13.854内完成,而在user system elapsed 0.932 0.365 1.320内完成system.time(dt1 <- CJ(words,words,unique = TRUE))。数量级的差异。 - Shawn Mehan
太棒了,感谢您进行基准测试。 - Akhil Nair
1个回答

6

words 中搜索每个组件,保留只出现一次的组件:

words[colSums(sapply(words, grepl, words, fixed = TRUE)) == 1]

提供:

[1] "please can you" "how did you"    "have you"   

太棒了 - 非常感谢! - Akhil Nair

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接