在一个字符串向量中统计每个字母出现的次数

Question

在一个字符串向量中统计每个字母出现的次数

3

我希望生成一个向量，其中包含向量a中出现的所有26个字母数字的总数。

a <- c("aabead", "dadfhhsa")

例如，这个向量中的a将等于5，b将等于1，d将等于2，z将等于0，x将等于0，等等。

- luke123

2个回答

6

您可以使用letters R内置向量以这种方式进行操作。

 > sapply(letters, function(x) x<-sum(x==unlist(strsplit(a,""))))
a b c d e f g h i j k l m n o p q r s t u v w x y z 
5 1 0 3 1 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

- Diego Jimeno

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

您只需要使用函数table和strsplit，并结合unlist进行一些辅助操作：

table(unlist(strsplit(a, ""), use.names=FALSE))
#
# a b d e f h s 
# 5 1 3 1 1 2 1

strsplit将字符串“分解”为单个字母。它创建了一个列表，每个字符串在向量“a”中有一个项目。
由于strsplit的输出是一个列表，因此您需要在对其进行制表之前对其进行unlist操作。use.names = FALSE只是给unlist提速。
table，你现在可能已经猜到了，这是用来制表的工具。

如果您真的想要零值，您还需要在其中插入一个factor，并借助内置的letters常量的一些帮助：

table(factor(unlist(strsplit(a, ""), use.names=FALSE), levels=letters))
#
# a b c d e f g h i j k l m n o p q r s t u v w x y z 
# 5 1 0 3 1 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

更新

当处理需要迭代大量值的问题时，重要的是考虑如何解决问题。

例如，在接受的答案中，unlist(strsplit(...)) 被调用了26次：每个字母都被调用一次。通过先拆分和取消列表值，然后使用 sapply，您会发现性能有显着的提升。请参见下面的 fun1a 和 fun2a 的性能差异。

为了参考，我还对我的基于 factor 的解决方案进行了基准测试，并尝试使用 tabulate 进行替代。可以看出，这些方法比使用 sapply 循环遍历单个字母要快得多。

library(stringi)
set.seed(1)
n <- 100000
a <- stri_rand_strings(n, sample(10, n, TRUE), "[a-z]")

fun1a <- function() sapply(letters, function(x) x<-sum(x==unlist(strsplit(a,""))))
fun1b <- function() {
  temp <- unlist(strsplit(a, ""))
  sapply(letters, function(x) {
    sum(x == temp)
  })
}
fun2 <- function() table(factor(unlist(strsplit(a, "", TRUE), use.names=FALSE), levels=letters))
fun3 <- function() {
  `names<-`(tabulate(
    factor(unlist(strsplit(a, "", TRUE), use.names = FALSE), 
           letters), nbins = 26), letters)
}

library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), fun3(), times = 10)
# Unit: milliseconds
#     expr        min         lq       mean     median         uq        max neval
#  fun1a() 1025.45449 1177.90226 1189.49551 1190.11137 1238.66071 1352.05645    10
#  fun1b()  102.94881  114.08700  115.14852  115.87184  119.06776  124.64735    10
#   fun2()   53.46341   58.67832   67.50402   68.94933   70.71005   95.10771    10
#   fun3()   46.65357   49.79365   51.68536   51.55922   54.36390   57.07582    10