使用Tidyverse进行多列乘积的逐行求和

Question

使用Tidyverse进行多列乘积的逐行求和

rdataframedplyrpurrr

3

问题

我希望找到一种优雅的tidyverse方案，以创建每个n列的m个乘积之和。我不想使用位置匹配，并且应该是可泛化的。

我尝试使用 purrr :: pmap_dbl（select（。，ends_with（i）），prod）进行尝试，但是没有取得很大进展。

m = 3和n = 2的示例

library(tidyverse)

df <- tibble(
  x_0 = c(5,6),
  x_1 = c(9,1),
  x_2 = c(2,1),
  y_0 = c(3,2),
  y_1 = c(3,2),
  y_2 = c(1,3)
)
df
> df
# A tibble: 2 × 6
# x_0   x_1   x_2   y_0   y_1   y_2
#<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#   5     9     2     3     3     1
#   6     1     1     2     2     3

我想要按行计算每个元素的乘积总和:
sum_of_products = x_0 * y_0 + x_1 * y_1 + x_2 + y_2

第一行: 5*3+9*3+2*2 = 46; 第二行: 6*2+1*2+1*3 = 17

期望输出结果

df_with_sum_of_products
# x_0   x_1   x_2   y_0   y_1   y_2  sum_of_products
#<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl>
#   5     9     2     3     3     1               46
#   6     1     1     2     2     3               17

- gosz

5个回答

2

一个可能的解决方案：

library(dplyr)

df %>% 
  mutate(sum_prod = rowSums(across(1:3)*across(4:6)))

#> # A tibble: 2 × 7
#>     x_0   x_1   x_2   y_0   y_1   y_2 sum_prod
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1     5     9     2     3     3     1       44
#> 2     6     1     1     2     2     3       17

或者更加通用（只有在x_2在x_1之前的情况下才需要使用relocate指令）：

library(dplyr)

df %>%
  relocate(order(names(.))) %>% 
  mutate(sum_prod = rowSums(across(starts_with("x"))*across(starts_with("y"))))

#> # A tibble: 2 × 7
#>     x_0   x_1   x_2   y_0   y_1   y_2 sum_prod
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1     5     9     2     3     3     1       44
#> 2     6     1     1     2     2     3       17

- PaulS

这很简洁，但不具有普适性，并且使用了位置匹配（这不是我想要的）。 - chamaoskurumi

等一下，@gosz。 - PaulS

请@gosz查看我的更新解决方案，完全可推广。 - PaulS

1

是的，太好了。请做吧 :) - ThomasIsCoding

2

如果您想确保始终获得精确的列对，请使用以下代码：df %>% mutate(sum_prod = rowSums(across(starts_with("x"), ~ . * get(sub("x", "y", cur_column()))))) :) - tmfmnk

显示剩余4条评论

1

我们可以使用rowSums + Reduce（但需要split.default将数据框分成两个部分，即x_和y_）。

df %>%
  select(order(names(.))) %>%
  mutate(sum_of_prod = rowSums(
    Reduce(
      `*`,
      split.default(., gsub("_.*", "", names(.)))
    )
  ))

或者，我们可以使用tcrossprod来计算乘积的总和。

df %>%
  select(order(names(.))) %>%
  mutate(sum_of_prod = diag(
    do.call(
      tcrossprod,
      lapply(
        split.default(., gsub("_.*", "", names(.))),
        as.matrix
      )
    )
  ))

我们将获得

# A tibble: 2 × 7
    x_0   x_1   x_2   y_0   y_1   y_2 sum_of_prod
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
1     5     9     2     3     3     1          44
2     6     1     1     2     2     3          17

- ThomasIsCoding

0

为了好玩，我们还可以构建一个公式并使用rowwise()：

library(tidyverse)

m <- unique(str_remove(names(df), ".*_"))
n <- unique(str_remove(names(df), "_.*"))
formula <- "0"
for (each_m in m) formula <- paste0(formula, " + ", paste0(paste0(n, "_", each_m), collapse = " * "))

df |>
  rowwise() |>
  mutate(sum = eval(parse_expr(formula)))
  ungroup()

输出：

# A tibble: 2 × 7
    x_0   x_1   x_2   y_0   y_1   y_2 sum_of_products
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl>
1     5     9     2     3     3     1              44
2     6     1     1     2     2     3              17

更新：通用（不太优雅）的解决方案。

- harre

0

另一个选项，只是为了好玩：

library(magrittr) ## for the %>% pipe operator

df$sum_of_products <- 
df %>%
  apply(1, \(r){
    r %>%
      matrix(.,,2) %>%
      apply(., 1, prod) %>%
      sum
  })

这个解决方案使用基本的R行级apply将每个数据框行分成两列矩阵，然后（另一个apply）将其减少为逐行列乘积，最后求和。

- user18309711

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Limey · Accepted Answer

为了获得一个具有完全推广性和鲁棒性的解决方案，我认为最好将数据框转换为更适合当前任务的形式。

df %>% 
  mutate(row=row_number()) %>% 
  pivot_longer(
    -row, 
    names_sep="_", 
    names_to=c("name", "index")
  ) %>%  
  group_by(row, index) %>% 
  pivot_wider(names_from=name, values_from=value)
# A tibble: 6 x 4
# Groups:   row, index [6]
    row index     x     y
  <int> <chr> <dbl> <dbl>
1     1 0         5     3
2     1 1         9     3
3     1 2         2     1
4     2 0         6     2
5     2 1         1     2
6     2 2         1     3

然后计算每个样本的特征值和权重之间的乘积总和，加上偏差(bias)。

df %>% 
  mutate(row=row_number()) %>% 
  pivot_longer(
    -row, 
    names_sep="_", 
    names_to=c("name", "index")
  ) %>%  
  group_by(row, index) %>% 
  pivot_wider(names_from=name, values_from=value) %>% 
  mutate(product=x * y) %>% 
  group_by(row) %>% 
  summarise(sum_product=sum(product))
# A tibble: 2 x 2
    row sum_product
  <int>       <dbl>
1     1          44
2     2          17

这个方法对行数、变量类型（如x、y和z）以及索引数（如1、2和3）都具有鲁棒性。

编辑

我之前声称上面的解决方案对于变量类型的数量是具有鲁棒性的是错误的。（因为在管道中读取mutate(product=x * y)的那一阶段。）这里有一个解决方案，连同修改后的输入数据集来演示它的鲁棒性。

df1 <- tibble(
  x_0 = c(5,6,1,-1), x_1 = c(9,1,1,3), x_2 = c(2,1,3,4),
  y_0 = c(3,2,1, 2), y_1 = c(3,2,2,2), y_2 = c(1,3,2,2),
  z_0 = c(4,5,1, 3), z_1 = c(3,1,2,1), z_2 = c(2,2,1,3)

)

df1 %>% 
  mutate(row=row_number()) %>% 
  pivot_longer(
    -row, 
    names_sep="_", 
    names_to=c("name", "index")
  ) %>%  
  group_by(row, index) %>% 
  pivot_wider(names_from=name, values_from=value) %>% 
  group_map(
    function(.x, .y, .keep=TRUE) {
      .y %>% bind_cols(.x %>% mutate(product = unlist(apply(.x, 1, prod))))
    }
  ) %>% bind_rows() %>% 
  group_by(row) %>% 
  summarise(sum_product=sum(product))
# A tibble: 4 x 2
    row sum_product
  <int>       <dbl>
1     1         145
2     2          68
3     3          11
4     4          24