在R中快速计算帕累托前沿

Question

在R中快速计算帕累托前沿

6

我正在尝试在R中计算帕累托前沿(http://en.wikipedia.org/wiki/Pareto_efficiency)，并且能够完成，但是效率不高。特别是当点对数增加时，计算速度显著减慢。

总体来说，我想要做的是检查所有非支配（或被支配）的点对。现在我一直在这样做的方式是找到所有这样的点对，使得 x_i > X 和 y_i > Y，其中（x_i，y_i）是单个点对，X和Y表示所有点的x和y。现在，这部分工作非常快速且易于实现，但是还有另外一种可能性，即多个x值可能相同，但它们将具有不同的y值，因此在这种情况下，我希望能够确定具有最低y值的x值（对于具有相同y值但不同的x值的点则反之）。

为了说明这一点，这里是维基百科上的一张图片：

enter image description here

所以基本上我想能够识别所有落在红线上的点。

以下是我的代码，虽然可行但对于大型数据集非常低效：

#Example Data that actually runs quickly
x = runif(10000)
y = runif(10000)

pareto = 1:length(x)

for(i in 1:length(x)){
    cond1 = y[i]!=min(y[which(x==x[i])])
    cond2 = x[i]!=min(x[which(y==y[i])])
    for(n in 1:length(x)){
        if((x[i]>x[n]  &  y[i]>y[n]) | (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2)){
            pareto[i] = NA
            break
        }
    }
}
#All points not on the red line should be marks as NA in the pareto variable

慢速计算明显来自于计算 (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2) 的点，但我找不到解决方法或更好的布尔表达式来捕获我想要的所有内容。非常感谢任何建议！

- user6291

2

在rPref包中，我使用C++进行了Pareto前沿（Skylines）的高效实现。 - Patrick Roocks

3个回答

3

编辑：新版本：

system.time( {
  pareto.2 <- logical(length(x))
  x.sort <- sort(x)
  y.sort <- y[order(x)]
  y.min <- max(y)
  for(i in 1:length(x.sort)) {
    if(pareto.2[i] <- y.sort[i] <= y.min) y.min <- y.sort[i]
  }    
} )
# user  system elapsed 
# 0.036   0.000   0.035

旧版本：

在我的系统上，这个更快了6倍。你可能可以使用更好的算法和 Rcpp 来取得更好的效果，但这个方法很直接。关键是按照 x 排序，这样就可以限制检查范围，确保所有先前的 x 值都具有更大的 y 值，以确保点位于边界上。

system.time( {
  pareto.2 <- logical(length(x))
  x.sort <- sort(x)
  y.sort <- y[order(x)]
  for(i in 1:length(x.sort)) {
    pareto.2[i] <- all(y.sort[1:i] >= y.sort[i])
  }    
} )
# user  system elapsed 
# 0.86    0.00    0.88

The original:

pareto = 1:length(x)
system.time(
  for(i in 1:length(x)){
    cond1 = y[i]!= min(y[which(x==x[i])])
    cond2 = x[i]!= min(x[which(y==y[i])])
    for(n in 1:length(x)){
      if((x[i]>x[n]  &  y[i]>y[n]) | (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2)){
        pareto[i] = NA
        break
      }
    }
  }  
)
# user  system elapsed 
# 5.32    0.00    5.33

展示这两种方法产生相同的结果（有点棘手，因为我需要将pareto.2重新排序为x的原始顺序）：

all.equal(pareto.2[match(1:length(x), order(x))], !is.na(pareto))
# [1] TRUE

- BrodieG

我实际上在别人的帖子中找到了一个更快的解决方法，尽管我喜欢for循环的可读性。要查看答案，请访问http://stackoverflow.com/questions/21296795/counting-points-in-r，在那里我现在正在提出另一个问题。 - user6291

@StatMan，看一下答案中的新版本。似乎与你找到的那个相当。 - BrodieG

0

想和大家分享我的解决方案，作为一个函数。已经经过测试，在N个Pareto前沿中表现得非常好。将fronts = Inf设置为计算所有前沿。

pareto_front <- function(x, y, fronts = 1, sort = TRUE) {
  stopifnot(length(x) == length(y))
  d <- data.frame(x, y)
  Dtemp <- D <- d[order(d$x, d$y, decreasing = FALSE), ]
  df <- data.frame()
  i <- 1
  while (nrow(Dtemp) >= 1 & i <= max(fronts)) {
    these <- Dtemp[which(!duplicated(cummin(Dtemp$y))), ]
    these$pareto_front <- i
    df <- rbind(df, these)
    Dtemp <- Dtemp[!row.names(Dtemp) %in% row.names(these), ]
    i <- i + 1
  }
  ret <- merge(x = d, y = df, by = c("x", "y"), all.x = TRUE, sort = sort)
  return(ret)
}

- Bernardo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user6291 · Accepted Answer

关注@BrodieG

system.time( {
    d = data.frame(x,y)
    D = d[order(d$x,d$y,decreasing=FALSE),]
    front = D[which(!duplicated(cummin(D$y))),]
} )

   user  system elapsed 
   0.02    0.00    0.02

这意味着速度快了43倍，计算方法为0.86/0.02。