数据框或矩阵中行求和

Question

数据框或矩阵中行求和

106

我有一个非常大的数据框，其中行是观察值，列是基因标记。我想使用R创建一个新列，该列包含每个观察值中选择的若干列的总和。

如果我有200列和100行，则希望创建一个新列，其中包含列43到167的总和，该列只包含1或0。通过包含每行总和的新列，我将能够对具有最多基因标记的个体进行排序。

我感觉应该差不多就是这样：

data$new=sum(data$[,43:167])

- user483502

7个回答

49

rowSums函数（正如Greg所提到的那样）可以实现您想要的功能，但是您在回答中混合了子集技术，请勿在使用“[]”时使用“$”，您的代码应该看起来更像：

data$new <- rowSums( data[,43:167] )

如果你想使用除了 sum 以外的函数，可以查看 ?apply 函数来对行或列应用通用函数。

- Greg Snow

我不确定为什么出现了这个错误：Error in rowSums(incomeData) : 'x' 必须是数值型的 - munmunbb

1

@munmunbb，你收到这个错误是因为incomeData不是数字。使用类似str(incomeData)的方法查看它是什么，然后可能将其转换为数字矩阵。 - Greg Snow

11

我来这里是希望找到一种方法来获取数据表中所有列的总和，但在实施上述解决方案时遇到了问题。使用cbind函数添加一个包含所有列总和的列的方法：

cbind(data, total = rowSums(data))

这种方法会向数据添加一个total列，并避免了使用以上解决方案对所有列进行求和时出现的对齐问题（有关此问题的讨论，请参见下面的帖子）。

向矩阵添加新列时出错

- seeiespi

1

请参阅[dplyr :: mutate_all]（https://dplyr.tidyverse.org/reference/summarise_all.html）。 - Paul Rougieux

6

为了完整性，我将列出其他方法，这些方法未在此处提到，使用矩阵的dplyr语法完成相同任务的不同方式：

mat = matrix(1:12, ncol = 3)

library(dplyr)

mat %>% as_tibble() %>% 
   mutate(sum = rowSums(across(where(is.numeric))))

# A tibble: 4 x 4
     V1    V2    V3   sum
  <int> <int> <int> <dbl>
1     1     5     9    15
2     2     6    10    18
3     3     7    11    21
4     4     8    12    24

或者 c_across：

mat %>% as_tibble() %>%
  rowwise() %>% 
  mutate(sumrange = sum(c_across(), na.rm = T))

或者通过列名选择特定的列：

mat %>% as_tibble() %>%
    mutate( 'B1' = V1, B2 = V2) %>% 
    rowwise() %>% 
    mutate(sum_startswithB = 
sum(c_across(starts_with("B")), na.rm = T))

     V1    V2    V3    B1    B2 sum_startswithx
  <int> <int> <int> <int> <int>           <int>
1     1     5     9     1     5               6
2     2     6    10     2     6               8
3     3     7    11     3     7              10
4     4     8    12     4     8              12

在此案例中，按列索引排序，从第一列到第四列：

mat %>% as_tibble() %>%
  mutate( 'B1' = V1, B2 = V2) %>%
  rowwise() %>% 
  mutate(SumByIndex = sum(c_across(c(1:4)), na.rm = T))

     V1    V2    V3    B1    B2 SumByIndex
  <int> <int> <int> <int> <int>      <int>
1     1     5     9     1     5         16
2     2     6    10     2     6         20
3     3     7    11     3     7         24
4     4     8    12     4     8         28

使用正则表达式：

mat %>% as_tibble() %>%
  mutate( 'B1' = V1, B2 = V2) %>%
  mutate(sum_V = rowSums(.[grep("V[2-3]", names(.))], na.rm = TRUE),
  sum_B = rowSums(.[grep("B", names(.))], na.rm = TRUE))

     V1    V2    V3    B1    B2 sum_V sum_B
  <int> <int> <int> <int> <int> <dbl> <dbl>
1     1     5     9     1     5    14     6
2     2     6    10     2     6    16     8
3     3     7    11     3     7    18    10
4     4     8    12     4     8    20    12

使用Apply函数更方便，因为您可以选择在列之间使用sum、mean、max、min、variance和standard deviation。

mat %>% as_tibble() %>%
  mutate( 'B1' = V1, B2 = V2) %>%
  mutate(sum = select(., V1:B1) %>% apply(1, sum, na.rm=TRUE)) %>%
  mutate(mean = select(., V1:B1) %>% apply(1, mean, na.rm=TRUE)) %>%
  mutate(max = select(., V1:B1) %>% apply(1, max, na.rm=TRUE)) %>%
  mutate(min = select(., V1:B1) %>% apply(1, min, na.rm=TRUE)) %>%  
  mutate(var = select(., V1:B1) %>% apply(1, var, na.rm=TRUE)) %>%
  mutate(sd = select(., V1:B1) %>% apply(1, sd, na.rm=TRUE))

     V1    V2    V3    B1    B2   sum  mean   max   min   var    sd
  <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <dbl>
1     1     5     9     1     5    16     4     9     1  14.7  3.83
2     2     6    10     2     6    20     5    10     2  14.7  3.83
3     3     7    11     3     7    24     6    11     3  14.7  3.83
4     4     8    12     4     8    28     7    12     4  14.7  3.83

注意：var和sd输出相同并不是错误，因为数据是线性生成的1:12，您可以通过计算第一列的值来验证。

> sd(c(1,5,9,1))
[1] 3.829708
> sd(c(2,6,10,2))
[1] 3.829708

- rubengavidia0x

你可以考虑更新这个...根据dplyr [1.1.0]（https://dplyr.tidyverse.org/news/index.html#dplyr-110）的最新版本，不推荐以这种方式使用`across`。他们引入了`pick`来进行整洁选择并返回一个tibble。 - LMc

3

我将通过一个例子尝试为您提供每个方法的经过时间支持：

mat = matrix(runif(4e6), ncol = 50)

apply函数与rowSums函数的比较：

apply_func <- function(x) {
    apply(x, 1, sum)
}

r_sum <- function(x) {
    rowSums(x)
}

# Compare the methods
microbenchmark(
    apply_func = app(mat),
    r_sum = r_sum(mat), times = 1e5
)

------ 输出 -- 毫秒为单位 --------

       expr       min        lq      mean    median        uq      max neval
 apply_func 207.84661 260.34475 280.14621 279.18782 294.85119 354.1821   100
      r_sum  10.76534  11.53194  13.00324  12.72792  14.34045  16.9014   100

你会注意到，使用rowSums函数的平均时间比apply函数小21倍。如果矩阵有太多列，那么经过的时间差异可能更显著。

- Hamzah

主要目标是想法，不管我正在处理的数据集是什么，适用于小矩阵的内容通常也适用于大型基准。 - Hamzah

1

谢谢您的建议。我已经将次数设置为100。 - Hamzah

1

这也可能有所帮助，但毫无疑问最佳选择是使用rowSums函数：

data$new <- Reduce(function(x, y) {
  x + data[, y]
}, init = data[, 43], 44:167)

- Anoushiravan R

1

你可以使用janitor包中的adorn_totals函数。根据你给出的参数where，你可以对列或行进行求和。

例如：

tibble::tibble(
a = 10:20,
b = 55:65,
c = 2010:2020,
d = c(LETTERS[1:11])) %>%
janitor::adorn_totals(where = "col") %>%
tibble::as_tibble()

结果：

# A tibble: 11 x 5
       a     b     c d     Total
   <int> <int> <int> <chr> <dbl>
 1    10    55  2010 A      2065
 2    11    56  2011 B      2067
 3    12    57  2012 C      2069
 4    13    58  2013 D      2071
 5    14    59  2014 E      2073
 6    15    60  2015 F      2075
 7    16    61  2016 G      2077
 8    17    62  2017 H      2079
 9    18    63  2018 I      2081
10    19    64  2019 J      2083
11    20    65  2020 K      2085

- Light

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Greg · Accepted Answer

143

您可以使用rowSums。 rowSums(data) 可以给您想要的结果。

- Greg

19

针对OP的问题data$new <- rowSums(data[43:167])，翻译为：将data数据中第43列到第167列的数据进行行求和，并将结果存入新的一列new中。 - Marek

12

为了节省时间，也许应避免与执行其他操作的rowsum函数混淆！ - Augustin