按行计算列数,避免使用melt/gather函数

3

I am working with a dataframe like this one:

   idno      08:00      08:05      08:10    08:15    08:20    08:25
1     1   Domestic   Domestic   Domestic Domestic Domestic Domestic
2     2    Leisure    Leisure    Leisure  Leisure  Leisure  Leisure
3     3        Eat        Eat        Eat      Eat      Eat      Eat
4     4       Paid       Paid       Paid     Paid     Paid     Paid
5     5      Sleep      Sleep      Sleep    Sleep    Sleep    Sleep
6     6        Eat        Eat        Eat  Missing  Missing  Missing
7     7      Sleep      Sleep      Sleep    Sleep    Sleep    Sleep
8     8       Paid       Paid       Paid     Paid     Paid     Paid
9     9      Sleep      Sleep      Sleep    Sleep    Sleep    Sleep
10   10 Child Care Child Care Child Care   Travel   Travel   Travel

我想要的是将这个数据框汇总成如下形式:
期望输出
       idno `Child Care` Domestic   Eat Leisure Missing  Paid Sleep Travel
*  <int>        <dbl>    <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>
1      1            0        6     0       0       0     0     0      0
2      2            0        0     0       6       0     0     0      0
3      3            0        0     6       0       0     0     0      0
4      4            0        0     0       0       0     6     0      0
5      5            0        0     0       0       0     0     6      0
6      6            0        0     3       0       3     0     0      0
7      7            0        0     0       0       0     0     6      0
8      8            0        0     0       0       0     6     0      0
9      9            0        0     0       0       0     0     6      0
10    10            3        0     0       0       0     0     0      3

我通常做的是这样:

melt(df, id.vars = 'idno') %>% count(idno, value) %>% spread(value, n, 0)

然而,我想知道是否有更加直接的方法来做到这一点。我的问题是,我正在使用一个非常大的数据库,并且使用 melt,然后使用 count 再使用 spread 可能会有点慢。

是否有一种直接的方法来计算每行的列(变量分布),最好使用 data.table

setDT(df)[,.N,by=] # 

每行的列数类似于“by”吗?
df = structure(list(idno = 1:10, `08:00` = c("Domestic", "Leisure", 
"Eat", "Paid", "Sleep", "Eat", "Sleep", "Paid", "Sleep", "Child Care"
), `08:05` = c("Domestic", "Leisure", "Eat", "Paid", "Sleep", 
"Eat", "Sleep", "Paid", "Sleep", "Child Care"), `08:10` = c("Domestic", 
"Leisure", "Eat", "Paid", "Sleep", "Eat", "Sleep", "Paid", "Sleep", 
"Child Care"), `08:15` = c("Domestic", "Leisure", "Eat", "Paid", 
"Sleep", "Missing", "Sleep", "Paid", "Sleep", "Travel"), `08:20` =    c("Domestic", 
"Leisure", "Eat", "Paid", "Sleep", "Missing", "Sleep", "Paid", 
"Sleep", "Travel"), `08:25` = c("Domestic", "Leisure", "Eat", 
"Paid", "Sleep", "Missing", "Sleep", "Paid", "Sleep", "Travel"
)), .Names = c("idno", "08:00", "08:05", "08:10", "08:15", "08:20", 
"08:25"), row.names = c(NA, 10L), class = "data.frame")

1
library(data.table); dcast(melt(setDT(df), id = 1), idno ~ value) - Jaap
这仍然是一个有趣的解决方案,感谢@Jaap。 - giac
1
xtabs(~ idno + value, data.table::melt(df, id = 1)) 的替代方案(基于 duplicate-target): - Jaap
1
另一种选择:library(dplyr); library(tidyr); df %>% gather(key, value, -1) %>% group_by(idno, value) %>% tally %>% spread(key = value, value = n, fill = 0) - Jaap
1
感谢提供所有的替代方案。让我保持警惕!谢谢。 - giac
1个回答

4
您可以尝试使用qdapTools中的mtabulate
library(qdapTools)

mtabulate(split(df[-1], seq(nrow(df))))

#   Child Care Domestic Eat Leisure Missing Paid Sleep Travel
#1           0        6   0       0       0    0     0      0
#2           0        0   0       6       0    0     0      0
#3           0        0   6       0       0    0     0      0
#4           0        0   0       0       0    6     0      0
#5           0        0   0       0       0    0     6      0
#6           0        0   3       0       3    0     0      0
#7           0        0   0       0       0    0     6      0
#8           0        0   0       0       0    6     0      0
#9           0        0   0       0       0    0     6      0
#10          3        0   0       0       0    0     0      3

非常有趣,谢谢。 - giac

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接