从文本中提取字符级别的n-gram（R语言）

Question

从文本中提取字符级别的n-gram（R语言）

3

我有一个包含文本的数据框，想要在R中提取每个文本的字符级别的二元组(n = 2)，例如"st"、"ac"、"ck"。

同时，我也想要统计文本中每个字符级别二元组的频率。

数据：

df$text

[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"

- Ana Wilmer

2个回答

3

除了Allen的回答之外，您可以使用stringdist软件包中的qgram函数与gsub结合使用来去除空格。

library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)

   hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1

- phiver

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Allan Cameron · Accepted Answer

我对你期望的输出结果并不十分确定。我认为单词 "stack" 的 bigrams 应该是 "st"，"ta"，"ac" 和 "ck"，因为这样可以捕获每个连续的字符对。

例如，如果你想知道单词 "brothers" 中 bigram "th" 出现的次数，并将其拆分成 bigram "br"、"ot"、"he" 和 "rs"，那么你会得到错误的答案 0。

您可以通过编写一个函数来获取所有 bigrams：

# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes 
# "st", "ta", "ac", and "ck"

pair_chars <- function(char_vec) {
  all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
  return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}

# This function splits a single word into a character vector and gets its bigrams

word_bigrams <- function(words){
  unlist(lapply(strsplit(words, ""), pair_chars))
}

# This function splits a string or vector of strings into words and gets their bigrams

string_bigrams <- function(strings){
  unlist(lapply(strsplit(strings, " "), word_bigrams))
}

那么现在我们可以在你的示例上进行测试：

df <- data.frame(text = c("hy my name is", "stackover flow is great", 
                          "how are you"), stringsAsFactors = FALSE)

string_bigrams(df$text)
#>  [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"

如果您想统计出现次数，只需使用 table ：

table(string_bigrams(df$text))

#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo 
#>  1  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1  1  1  1  2  2  1  1  1  1

然而，如果您要进行大量文本挖掘，应该考虑使用特定的R软件包，如stringi、stringr、tm和quanteda，它们可以帮助完成基本任务。

例如，上面我编写的所有基本R函数都可以使用quanteda软件包替换，如下所示：

library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#>  [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck" 
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"

^{这段内容是由 reprex包 (v0.3.0) 于2020-06-13创建的。}