在R中从向量计算字符频率

Question

在R中从向量计算字符频率

5

我有一个电子书文本文件，名为 Frankenstein.txt，我想知道小说中每个字母出现的次数。

我的设置：

我导入了文本文件，像这样得到一个字符向量 character_array：

string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))

< p >character_array 给我类似这样的东西。

 "F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...

我的目标:

我想要获取文本文件中每个字符出现次数的计数。换句话说，我想要针对unique(character_array)获取每个唯一字符的计数。

 [1] "F"  "r"  "a"  "n"  "k"  "e"  "s"  "t"  "i"  "\r" "\n" "b"  "y"  "M" 
 [15] " "  "W"  "o"  "l"  "c"  "f"  "("  "G"  "d"  "w"  ")"  "S"  "h"  "C" 
 [29] "O"  "N"  "T"  "E"  "L"  "1"  "2"  "3"  "4"  "p"  "5"  "6"  "7"  "8" 
 [43] "9"  "0"  "_"  "."  "v"  ","  "g"  "P"  "u"  "D"  "—"  "Y"  "j"  "m" 
 [57] "I"  "z"  "?"  ";"  "x"  "q"  "B"  "U"  "’"  "H"  "-"  "A"  "!"  ":" 
 [71] "R"  "J"  "“"  "”"  "æ"  "V"  "K"  "["  "]"  "‘"  "ê"  "ô"  "é"  "è"

我的尝试 当我调用plot(as.factor(character_array))时，我会得到一个漂亮的图表，可以在视觉上给我想要的结果。然而，我需要获取每个字符的确切值。我希望有像2D数组这样的东西：

    [,1]   [,2] [,3] [,4] ... 
[1,] "a"    "A"  "b"  "B" ...
[2,] "1202" "50" "12" "9" ...

- Paul Trimor

4

最终你可以使用 table() 函数。 - jogo

1

尝试使用 summary(as.factor(character_array))。 - Rohit

谢谢@Rohit，这正是我在寻找的。简单明了，我太蠢了哈哈。 - Paul Trimor

2个回答

0

使用gutenbergr、tidytext和dplyr，您可以做到以下几点：

library(gutenbergr)
library(tidytext)
library(dplyr)

frank <- gutenberg_download(c(84), meta_fields = "title")

移除不必要的字符，如. [ ]等。

frank %>% 
  unnest_tokens(chars, text, "characters") %>% 
  group_by(chars) %>% 
  summarise(n = n()) %>% 
  t() #transpose to get in order of OP
      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]  
chars "0"     "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "a"     "b"     "c"     "d"     "e"     "f"    
n     "    2" "   35" "   15" "    6" "    4" "    4" "    3" "   16" "    5" "    4" "25733" " 4749" " 8644" "16327" "44210" " 8341"
      [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]  
chars "g"     "h"     "i"     "j"     "k"     "l"     "m"     "n"     "o"     "p"     "q"     "r"     "s"     "t"     "u"     "v"    
n     " 5564" "19194" "23483" "  413" " 1617" "12239" "10237" "23306" "23886" " 5672" "  313" "19647" "20380" "28835" " 9897" " 3717"
      [,33]   [,34]   [,35]   [,36]  
chars "w"     "x"     "y"     "z"    
n     " 7364" "  649" " 7578" "  239"

如果你想要这些字符，代码应该是这样的：

frank %>% 
  unnest_tokens(chars, text, stringr::str_split, pattern = "") %>% 
  group_by(chars) %>% 
  summarise(n = n()) %>% 
  t() #transpose to get in order of OP

      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]  
chars "'"     "-"     " "     "!"     "\""    "("     ")"     ","     "."     ":"     ";"     "?"     "["     "]"     "_"     "0"    
n     "  221" "  370" "71202" "  238" "  774" "   16" "   16" " 4945" " 2904" "   48" "  970" "  220" "    3" "    3" "    2" "    2"
      [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]  
chars "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "a"     "b"     "c"     "d"     "e"     "f"     "g"    
n     "   35" "   15" "    6" "    4" "    4" "    3" "   16" "    5" "    4" "25733" " 4749" " 8644" "16327" "44210" " 8341" " 5564"
      [,33]   [,34]   [,35]   [,36]   [,37]   [,38]   [,39]   [,40]   [,41]   [,42]   [,43]   [,44]   [,45]   [,46]   [,47]   [,48]  
chars "h"     "i"     "j"     "k"     "l"     "m"     "n"     "o"     "p"     "q"     "r"     "s"     "t"     "u"     "v"     "w"    
n     "19194" "23483" "  413" " 1617" "12239" "10237" "23306" "23886" " 5672" "  313" "19647" "20380" "28835" " 9897" " 3717" " 7364"
      [,49]   [,50]   [,51]  
chars "x"     "y"     "z"    
n     "  649" " 7578" "  239"

- phiver

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- lefft · Accepted Answer

使用magrittr::%>%管道可以很好地构建这类文本处理管道。以下是一种方法，假设您的文本在 "frank.txt" 中（有关每个步骤的解释，请参见底部）：

library(magrittr)

# read the text in 
frank_txt <- readLines("frank.txt")

# then send the text down this pipeline:
frank_txt %>% 
  paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% 
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  table %>% 
  barplot

请注意，您可以在table()处停止并将结果分配给一个变量，然后可以按照您的意愿进行操作，例如通过绘图：

char_counts <- frank_txt %>% paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
  table

barplot(char_counts)

你也可以将表格转换为数据框，以便稍后更轻松地进行操作/绘图：

counts_df <- data.frame(
  char = names(char_counts), 
  count = as.numeric(char_counts), 
  stringsAsFactors=FALSE)

head(counts_df)
## char count
##   a    13
##   b     2
##   c     7
##   d     5
##   e    24
##   f     6

每个步骤的解释： 这是完整的管道链，每个步骤都有解释：

# going to send this text down a pipeline:
frank_txt %>% 
  # combine lines into a single string (makes things easier downstream)
  paste(collapse="") %>% 
  # tokenize by character (strsplit returns a list, so unlist it)
  strsplit(split="") %>% unlist %>% 
  # remove instances of characters you don't care about
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  # make a frequency table of the characters
  table %>% 
  # then plot them
  barplot

注意，这与以下可怕（“巨大的”？！？）的代码完全等效——前向管道%>%只是将其右侧的函数应用于其左侧的值（而.是代词，指左侧的值；请参见intro vignette）。

barplot(table(
  unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
    !unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in% 
      c(""," ",".",",")]))