在R中从矩阵中选择具有最高组合值的n行

Question

在R中从矩阵中选择具有最高组合值的n行

3

这是一个大矩阵的一部分（维度大约为：1,000-1,000,000行 x 100 - 1,000列）：

     scen_1   scen_2  scen_3    scen_4 ...
...
9  3.262275 0.000000 0.00000 0.0000000 ...
10 2.843631 0.000000 1.22636 1.0559217 ...
11 0.000000 0.000000 0.00000 0.9836209 ...
12 2.572686 0.000000 0.00000 1.1000293 ...
13 0.000000 0.000000 0.00000 0.0000000 ...
14 0.611070 1.478159 0.00000 0.0000000 ...
15 0.000000 0.000000 0.00000 0.0000000 ...
16 0.000000 0.000000 0.00000 1.0146529 ...
...

现在，我想选择n行，在每列获得最大值后，有最高的总和，因此行之间要很好地互补。例如，我选择第9和第10行，我得到组合（最大值）向量3.262275 0.00000 1.22636 1.0559217，总和为5.5445567。而如果我选择第14和第16行，我会得到0.611070 1.478159 0.00000 1.0146529，总和为3.1038819，因此第一种选择更好。

对于上述示例，n为3的解决方法将是10、14和9行。我希望我能解释清楚问题。

我的方法是首先选择行总和最高的行，然后选择增加最高附加值的行。但我强烈感觉这并不总是给出最佳解决方案。由于矩阵的大小，计算所有可能的组合是不可行的。遗传算法是否可行？还是有更简单的方法？

谢谢。

编辑：

为了更容易理解，这里是一个MWE：

# Create example matrix
mat <- matrix(c(1.562275, 0.000000, 0.00000, 0.0000000,2.843631, 0.000000, 1.22636, 1.0559217,0.000000, 0.000000, 0.00000, 0.9836209,1.572686, 0.000000, 0.00000, 1.8000293,0.000000, 0.000000, 0.00000, 0.0000000,1.611070, 1.478159, 0.00000, 0.0000000,0.000000, 0.000000, 0.00000, 0.0000000,0.000000, 0.000000, 0.00000, 1.0146529), byrow = TRUE,  ncol = 4, dimnames = list(c(9:16), c("scen_1",  "scen_2",  "scen_3", "scen_4")))

# Function to evaluate each combination of rows (this value should be maximized)
get_combined_max_value_sum <- function(choosen_rows){
  # Select rows
  sel_mat <- mat[choosen_rows,]
  
  # calculate columwise max
  max_mat <- apply(sel_mat, 2, max)
  
  # Sum the values
  return(sum(max_mat))
}

# I am looking for the function best_rows() which returns the rows, which gives the 
# maximum value (or at least a close guess) for the get_combined_max_value_sum() 
# function
best_rows <- function(n_rows){
  result <- vector()
  
  # do some magic
  
  return(result) # vector with length n_row for the "best" rows.
}

# ------------------------------------------------
# @ slamballais
# The rows with the highest rowise sum (10 & 12)
get_combined_max_value_sum(c("10","12"))

# get a lower score then row 9 and 13
get_combined_max_value_sum(c("10","14"))

- WitheShadow

1

问题并不是非常清楚。你不能只使用 rowSums 来获取每行的总和，然后将行从大到小排序吗？然后你可以选择前 N 行，其中 N 是你想要的任何值。 - slamballais

是的，你没有告诉我们一些信息。如果你使用第9行，那么其他行是否不能使用第9行的非零列，还是怎么样？ - dash2

你不应该按行计算 max 吗？ max_mat <- apply(sel_mat, 1, max)。另外，为了澄清，你不是在寻找检查每个值组合的答案吗？ - Ronak Shah

对你来说，什么更重要，是值还是值的索引？ - Chris

不，我正在寻找按列的最大值。将其想象为以下形式：列（scen_1、scen_2等）是未来可能出现的情景。行是您可以选择的不同选项。这些值是此选项在此情景中“表现”得有多好的分数。我现在可以选择n个选项（约10个），但在未来，我只关心发生情况的最佳选项。简而言之：我现在选择n行（选项），然后随机选择一列。这个新矩阵中的最高值就是我的结果。我首先要选择哪些行以最大化我的结果。 - WitheShadow

显示剩余2条评论

2个回答

3

以下翻译为非最优解，但可能会启发他人…

假设条件

答案有k行，其中k由用户预先指定。
k <= mat的列数

回答

某些行永远不会成为答案的一部分。我建议在应用暴力方法之前过滤掉这些行。到目前为止，过滤条件如下：

删除总和小于最大列值的最低值的行
删除所有值都低于包含最大列值的任何行的行

代码

slam <- function(mat, k) {
  cm <- apply(mat, 2, max)
  rs <- apply(mat, 1, function(x) sum(x[x > 0], na.rm = TRUE))
  
  # remove rows whose sum is lower than the lowest column max
  matb <- subset(mat, rs > min(cm))
  
  # remove rows that have only values lower than all values of the rows containing a column max
  mrows <- matb[apply(matb, 2, which.max), ]
  any_bigger <- apply(mrows, 1, function(x) rowSums(sweep(matb, 2, x, `-`) >= 0) > 0)
  matc <- matb[apply(any_bigger, 1, all), ]
  
  # code copied + modified from @ThomasIsCoding's answer
  rs <- combn(nrow(matc), k)
  row.names(matc)[rs[, which.max(apply(rs, 2, function(z) sum(do.call(pmax, data.frame(t(matc[z, ]))))))]]
}

示例和基准测试


# bigger dataset with 100 rows and negative values too
n <- 100
n2 <- 500
set.seed(2021)
mat2 <- matrix(rnorm(n * 4), ncol = 4, dimnames = list(c(1:n), c("scen_1",  "scen_2",  "scen_3", "scen_4")))
mat3 <- matrix(rnorm(n2 * 4), ncol = 4, dimnames = list(c(1:n2), c("scen_1",  "scen_2",  "scen_3", "scen_4")))

# verification
slam(mat, 3)     # [1] "10" "12" "14"
thomas(mat)      # [1] "10" "12" "14"
slam(mat2, 3)    # [1] "25" "44" "99"
thomas(mat2)     # [1] "25" "44" "99"

# benchmark (without `thomas(mat3)`, it takes too long)
microbenchmark::microbenchmark(slam(mat2, 3), thomas(mat2),
                               slam(mat3, 3), times = 1L)

# Unit: milliseconds
#          expr        min         lq       mean     median         uq        max neval
# slam(mat2, 3)   249.4705   249.4705   249.4705   249.4705   249.4705   249.4705     1
#  thomas(mat2) 19557.8194 19557.8194 19557.8194 19557.8194 19557.8194 19557.8194     1
# slam(mat3, 3) 16159.9113 16159.9113 16159.9113 16159.9113 16159.9113 16159.9113     1

最后的思考

还有一种方法来完成这个任务。首先从包含前k个最大列值的k行中选择一组初始组合。对于这些行，计算是否存在其他行可以在其余的列上提供更优的结果。如果存在更好的行，则尝试将其与初始组合交换。不断重复此过程直到选择到最佳的行。我现在没时间写代码，但如果明天之前还没有完成，我会尝试一下。

- slamballais

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ThomasIsCoding · Accepted Answer

更新（递归方法，次优解）

您可以定义一个递归函数f（请参见函数thomas2中的代码），该函数可以是任意行数k（1 <= k <= nrow(mat)）。

thomas2 <- function(mat, k) {
  f <- function(mat, k) {
    if (k == 1) {
      return(which.max(rowSums(mat)))
    }
    p <- f(mat, k - 1)
    q <- seq(nrow(mat))[-p]
    rmax <- apply(mat[p, , drop = FALSE], 2, max)
    c(p, q[which.max(sapply(q, function(k) sum(pmax(rmax, mat[k, ]))))])
  }
  row.names(mat)[sort(f(mat, k))]
}

例如

> thomas2(mat, 2)
[1] "10" "14"

> thomas2(mat, 3)
[1] "10" "12" "14"

> thomas2(mat, 4)
[1] "9"  "10" "12" "14"

> thomas2(mat, 5)
[1] "9"  "10" "11" "12" "14"

> thomas2(mat, 6)
[1] "9"  "10" "11" "12" "13" "14"

之前的答案（暴力法，效率低下）

你的算法是一种贪心算法，不能保证始终达到全局最大值。因此，暴力法可能是达成目标的一个简单方法。

也许你可以尝试以下暴力法：

rs <- combn(nrow(mat), 3)
row.names(mat)[rs[, which.max(apply(rs, 2, function(k) sum(do.call(pmax, data.frame(t(mat[k, ]))))))]]

这提供了

[1] "10" "12" "14"