按列分组，在R中计算其他每一列的平均值和标准差

Question

按列分组，在R中计算其他每一列的平均值和标准差

3

如何在R中按列进行分组，然后计算每个其他列的平均值和标准差？

以著名的鸢尾花数据集为例。我想做类似于按物种分组的操作，然后计算花瓣/萼片长度/宽度测量值的平均值和标准差。我知道这与拆分-应用-合并有关，但我不确定该如何继续操作。

我的建议：

require(plyr)

x <- ddply(iris, .(Species), summarise,
    Sepal.Length.Mean = mean(Sepal.Length),
    Sepal.Length.Sd = sd(Sepal.Length),
    Sepal.Width.Mean = mean(Sepal.Width),
    Sepal.Width.Sd = sd(Sepal.Width),
    Petal.Length.Mean = mean(Petal.Length),
    Petal.Length.Sd = sd(Petal.Length),
    Petal.Width.Mean = mean(Petal.Width),
    Petal.Width.Sd = sd(Petal.Width))

     Species Sepal.Length.Mean Sepal.Length.Sd Sepal.Width.Mean Sepal.Width.Sd
1     setosa             5.006       0.3524897            3.428      0.3790644
2 versicolor             5.936       0.5161711            2.770      0.3137983
3  virginica             6.588       0.6358796            2.974      0.3224966
  Petal.Length.Mean Petal.Length.Sd Petal.Width.Mean Petal.Width.Sd
1             1.462       0.1736640            0.246      0.1053856
2             4.260       0.4699110            1.326      0.1977527
3             5.552       0.5518947            2.026      0.2746501

期望的输出：

z <- data.frame(setosa = c(5.006, 0.3524897, 3.428, 0.3790644, 1.462, 0.1736640, 0.246, 0.1053856), versicolor = c(5.936, 0.5161711, 2.770, 0.3137983, 4.260, 0.4699110, 1.326, 0.1977527), virginica = c(6.588, 0.6358796, 2.974, 0.3225966, 5.552, 0.5518947, 2.026, 0.2746501)) rownames(z) <- c('Sepal.Length.Mean', 'Sepal.Length.Sd', 'Sepal.Width.Mean', 'Sepal.Width.Sd', 'Petal.Length.Mean', 'Petal.Length.Sd', 'Petal.Width.Mean', 'Petal.Width.Sd') setosa versicolor virginica Sepal.Length.Mean 5.0060000 5.9360000 6.5880000 Sepal.Length.Sd 0.3524897 0.5161711 0.6358796 Sepal.Width.Mean 3.4280000 2.7700000 2.9740000 Sepal.Width.Sd 0.3790644 0.3137983 0.3225966 Petal.Length.Mean 1.4620000 4.2600000 5.5520000 Petal.Length.Sd 0.1736640 0.4699110 0.5518947 Petal.Width.Mean 0.2460000 1.3260000 2.0260000 Petal.Width.Sd 0.1053856 0.1977527 0.2746501

- I Like to Code

我认为单元格“setosa”/“Sepal.Length.Mean”的值应该是5.006，而不是“desired output”中的0.5006（看起来像是笔误）。如果没有人反对，我将编辑问题以修复此问题。 - R Yoda

4个回答

3

以下是传统的plyr方法。它使用colwise在所有列上计算汇总统计信息。

means <- ddply(iris, .(Species), colwise(mean))
sds <- ddply(iris, .(Species), colwise(sd))
merge(means, sds, by = "Species", suffixes = c(".mean", ".sd"))

- Richie Cotton

1

如果您想出于性能原因使用 data.table，您可以尝试这个（不要害怕 - 注释比代码多;-) 我已经尝试优化了所有关键的性能点。）

library(data.table)
dt <- as.data.table(iris)

# Helper function similar to "colwise" of package "plyr":
# Apply a function "func" to each column of the data.table "data"
# and append the "suffix" string to the result column name.
colwise.dt <- function( data, func, suffix )
{
  result <- lapply(data, func)                                      # apply the function to each column of the data table
  setDT(result)                                                     # convert the result list into a data table efficiently ("by ref")
  setnames(result, names(result), paste0(names(result), suffix))    # append suffix to each column name efficiently ("by ref"). "setnames" requires a data.table
}

wide.result <- dt[, c(colwise.dt(.SD, mean, ".mean"), colwise.dt(.SD, sd, ".sd")), by=.(Species)]
# Note: .SD is a data.table containing the subset of dt's data for each group (Species), excluding any columns used in "by" (here: Species column)

# Now transpose the result
long.result <- melt(wide.result, id.vars="Species")

# Now transform into one column per group
final.result <- dcast(long.result, variable ~ Species)

wide.result 是：

      Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
1:     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644       0.1736640      0.1053856
2: versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983       0.4699110      0.1977527
3:  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966       0.5518947      0.2746501

long.result 是：

       Species          variable     value
 1:     setosa Sepal.Length.mean 5.0060000
 2: versicolor Sepal.Length.mean 5.9360000
 3:  virginica Sepal.Length.mean 6.5880000
 4:     setosa  Sepal.Width.mean 3.4280000
 5: versicolor  Sepal.Width.mean 2.7700000
 6:  virginica  Sepal.Width.mean 2.9740000
 7:     setosa Petal.Length.mean 1.4620000
 8: versicolor Petal.Length.mean 4.2600000
 9:  virginica Petal.Length.mean 5.5520000
10:     setosa  Petal.Width.mean 0.2460000
11: versicolor  Petal.Width.mean 1.3260000
12:  virginica  Petal.Width.mean 2.0260000
13:     setosa   Sepal.Length.sd 0.3524897
14: versicolor   Sepal.Length.sd 0.5161711
15:  virginica   Sepal.Length.sd 0.6358796
16:     setosa    Sepal.Width.sd 0.3790644
17: versicolor    Sepal.Width.sd 0.3137983
18:  virginica    Sepal.Width.sd 0.3224966
19:     setosa   Petal.Length.sd 0.1736640
20: versicolor   Petal.Length.sd 0.4699110
21:  virginica   Petal.Length.sd 0.5518947
22:     setosa    Petal.Width.sd 0.1053856
23: versicolor    Petal.Width.sd 0.1977527
24:  virginica    Petal.Width.sd 0.2746501

final.result 是：

            variable    setosa versicolor virginica
1: Sepal.Length.mean 5.0060000  5.9360000 6.5880000
2:  Sepal.Width.mean 3.4280000  2.7700000 2.9740000
3: Petal.Length.mean 1.4620000  4.2600000 5.5520000
4:  Petal.Width.mean 0.2460000  1.3260000 2.0260000
5:   Sepal.Length.sd 0.3524897  0.5161711 0.6358796
6:    Sepal.Width.sd 0.3790644  0.3137983 0.3224966
7:   Petal.Length.sd 0.1736640  0.4699110 0.5518947
8:    Petal.Width.sd 0.1053856  0.1977527 0.2746501

您想要的输出唯一的区别是，final结果包含在第一列中命名为variable的值名称，而不是存储在行名称中。这可以通过将行名称设置为第一列并删除第一列来完成...

- R Yoda

1

受到答案的启发，我想出了一个解决方案，只使用dplyr和tidyr函数也能实现。

require(tidyr)
require(dplyr)

x <- iris %>%
    gather(var, value, -Species)
print(tbl_df(x))

# Compute the mean and sd for each dimension
x <- x %>%
    group_by(Species, var) %>%
    summarise(mean = mean(value), sd = sd(value)) %>%
    ungroup
print(tbl_df(x))

# Convert the data frame from wide form to long form
x <- x %>%
    gather(stat, value, mean:sd)
print(tbl_df(x))

# Combine the variables "var" and "stat" into a single variable
x <- x %>%
    unite(var, var, stat, sep = '.')
print(tbl_df(x))

# Convert the data frame from long form to wide form
x <- x %>%
    spread(Species, value)
print(tbl_df(x))

- I Like to Code

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

我们可以尝试使用dplyr。

library(dplyr)
res <- iris %>% 
         group_by(Species) %>% 
         summarise_each(funs(mean, sd))
`colnames<-`(t(res[-1]), as.character(res$Species))
#                     setosa versicolor virginica
#Sepal.Length_mean 5.0060000  5.9360000 6.5880000
#Sepal.Width_mean  3.4280000  2.7700000 2.9740000
#Petal.Length_mean 1.4620000  4.2600000 5.5520000
#Petal.Width_mean  0.2460000  1.3260000 2.0260000
#Sepal.Length_sd   0.3524897  0.5161711 0.6358796
#Sepal.Width_sd    0.3790644  0.3137983 0.3224966
#Petal.Length_sd   0.1736640  0.4699110 0.5518947
#Petal.Width_sd    0.1053856  0.1977527 0.2746501

正如评论中@Steven Beaupre所提到的，可以通过使用spread进行重塑来获得输出结果。

library(tidyr)
iris %>% 
   group_by(Species) %>% 
   summarise_each(funs(mean, sd)) %>% 
   gather(key, value, -Species) %>% 
   spread(Species, value)