有更好的方法吗?我想从这个向量中删除所有作为其他元素的子字符串存在的字符串。
words = c("please can you",
"please can",
"can you",
"how did you",
"did you",
"have you")
> words
[1] "please can you" "please can" "can you" "how did you" "did you" "have you"
library(data.table)
library(stringr)
dt = setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE))
dt[, found := str_detect(word1, word2)]
setdiff(words, dt[found == TRUE & word1 != word2, word2])
[1] "please can you" "how did you" "have you"
这个方法可行,但看起来有些冗长,我想知道更加优雅的解决方案。
CJ
是比data.table
更快的expand.grid
函数。 - RorschachCJ
速度快得多。我拿了12431
行,平均每行15.69
个单词,总共195065
个单词,并通过system.time(dt <- setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE)))
运行它,在user system elapsed 8.414 3.387 13.854
内完成,而在user system elapsed 0.932 0.365 1.320
内完成system.time(dt1 <- CJ(words,words,unique = TRUE))
。数量级的差异。 - Shawn Mehan