R 总结字符计数

3
    library(data.table)
DATA=data.table(STUDENT= c(1,2,3,4),
    DOG_1= c("a","e","a","c"),
    DOG_2= c("a","e","d","b"),
    DOG_3= c("a","d","b","c"),
    CAT_1= c("c","a","d","c"),
    CAT_2= c("c","d","a","b"),
    MOUSE_1= c("d","b","e","b"),
    MOUSE_2= c("c","a","b","e"),
    MOUSE_3= c("a","b","b","e"),
    MOUSE_4= c("b","c","a","d"))

这是我的数据的样子。我希望得到一个新的数据,它的样子应该像这样:

enter image description here

其中'a'等于1;'b'等于2;'c'等于3;'d'等于4;'e'等于5,例如要获取STUDENT 1 DOG的值为3,需要将字母转换为对应的数值并求和。

3个回答

2
为了完整起见,这里有两种 方法,它们在调用 melt() 时使用新的 measure() 函数(在 data.table 版本 1.14.1 中可用)。

1. 融合、动态连接一个查找表、重组

melt(DATA, measure.vars = measure(animal, rn, pattern = "(\\w+)_(\\d)"), value.name = "code")[
  .(code = letters[1:5], value = 1:5), on = "code", value := i.value][
    , dcast(.SD, STUDENT ~ animal, sum, value.var = "value")]
   STUDENT CAT DOG MOUSE
1:       1   6   3    10
2:       2   5  14     8
3:       3   5   7    10
4:       4   5   8    16

2. 融合和求和因子水平

当字母 ae 转换为因子时,相应的因子水平得到数字值 15

library(magrittr) # piping used to improve readability
melt(DATA, measure.vars = measure(value.name, rn, pattern = "(\\w+)_(\\d)"))[, rn := NULL][
  , lapply(.SD, \(x) factor(x, levels = letters[1:5]) %>% as.integer() %>% sum(na.rm = TRUE)), 
  by = STUDENT]
   STUDENT DOG CAT MOUSE
1:       1   3   6    10
2:       2  14   5     8
3:       3   7   5    10
4:       4   8   5    16

2
如果我们想要使用data.table解决方案,需要通过指定来自列名的patterns转换'DATA'为'long'格式(使用melt),然后使用一个命名的vector('keyval')进行分组,分组方式为'STUDENT',循环处理在.SDcols中指定的列,将匹配的值替换成整数值并且使用sum进行求和。
library(data.table)
nm1 <- unique(sub("_\\d+$", "", names(DATA)[-1]))      
dt1 <- melt(DATA, id.var = 'STUDENT', 
   measure = patterns(nm1), value.name = nm1)
keyval <- setNames(1:5, letters[1:5])

dt1[, lapply(.SD, function(x) sum(keyval[x], 
           na.rm = TRUE)), by = STUDENT, .SDcols = nm1]

-输出

#   STUDENT DOG CAT MOUSE
#1:       1   3   6    10
#2:       2  14   5     8
#3:       3   7   5    10
#4:       4   8   5    16

tidyverse 中有一个类似的选项。
library(dplyr)
library(tidyr)
DATA %>% 
    pivot_longer(cols = -STUDENT, names_to = c('.value', 'grp'), 
          names_sep='_') %>%
    group_by(STUDENT) %>% 
    summarise(across(all_of(nm1), ~ sum(keyval[.], na.rm = TRUE)))
# A tibble: 4 x 4
#  STUDENT   DOG   CAT MOUSE
#    <dbl> <int> <int> <int>
#1       1     3     6    10
#2       2    14     5     8
#3       3     7     5    10
#4       4     8     5    16

1

另一个使用 melt + dcastdata.table 选项。

dcast(
  melt(DATA, id.var = "STUDENT")[
    ,
    c("variable", "value") := .(gsub("_.*", "", variable),
      value = setNames(1:5, c("a", "b", "c", "d", "e"))[value]
    )
  ], STUDENT ~ variable, sum
)

提供

   STUDENT CAT DOG MOUSE
1:       1   6   3    10
2:       2   5  14     8
3:       3   5   7    10
4:       4   5   8    16

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接