按照组累计最小值和最大值

Question

按照组累计最小值和最大值

4

我正在尝试计算R语言中数据框的最小值范围。数据框长这个样子：

+-----+--------------+-----------+------+------+
| Key | DaysToEvent  | PriceEUR  | Pmin | Pmax |
+-----+--------------+-----------+------+------+
| AAA | 120          |        50 |   50 |   50 |
| AAA | 110          |        40 |   40 |   50 |
| AAA | 100          |        60 |   40 |   60 |
| BBB | ...          |           |      |      |
+-----+--------------+-----------+------+------+

所以，范围最低价格（Pmin）保持该键的最低价格，直到那个时间点（DaysToEvent）。

这是我的实现：

for (i in 1:nrow(data)){
  currentRecord <- data[i,]

  if(currentRecord$Key != currentKey) {
    # New key detected - reset pmin and pmax
    pmin <- 100000
    pmax <- 0
    currentKey <- currentRecord$Key
  }

  if(currentRecord$PriceEUR < pmin) {
    pmin <- currentRecord$PriceEUR
  }
  if(currentRecord$PriceEUR > pmax) {
    pmax <- currentRecord$PriceEUR
  }

  currentRecord$Pmin <- pmin
  currentRecord$Pmax <- pmax

  # This line seems to be killing my performance
  # but otherwise the data variable is not updated in
  # global space
  data[i,] <- currentRecord
}

这个方法是可以运行的，但速度真的非常慢，每秒只能处理几个。它之所以能工作，是因为我已经按照如下方式对数据框进行了排序：data = data[order(data$Key, -data$DaysToEvent), ]。做这个的原因是，我希望对排序的时间复杂度达到nlog(n)，并且for循环的时间复杂度达到n。所以我认为我会快速处理这些数据，但实际上完全不是这样的，需要花费数小时。

有什么方法可以让这个过程更快吗？

前面的方法是来自我的同事，这里是伪代码：

for (i in 1:nrow(data)) {
    ...
    currentRecord$Pmin <- data[subset on the key[find the min value of the price 
                      where DaysToEvent > currentRecord$DaysToEvent]]
    ...
}

同样可行 - 但我认为这些函数的顺序要高得多。n^2log(n) 如果我没记错的话，需要几天时间。所以我认为我会有很大的改进。

所以我已经尝试理解了各种*apply、by函数，当然这就是你真正想使用的。

然而 - 如果我使用by()然后在关键字上分割。可以让我接近目标。然而，我无法绕过如何获得最小/最大范围。

我试着用函数式范式思考，但我卡住了。任何帮助都将不胜感激。

- Jochen van Wylick

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marat Talipov · Accepted Answer

[原始回答：dplyr]

您可以使用dplyr软件包解决此问题：

library(dplyr)
d %>% 
  group_by(Key) %>% 
  mutate(Pmin=cummin(PriceEUR),Pmax=cummax(PriceEUR))

#   Key DaysToEvent PriceEUR Pmin Pmax
# 1 AAA         120       50   50   50
# 2 AAA         110       40   40   50
# 3 AAA         100       60   40   60
# 4 BBB         100       50   50   50

其中d应该是您的数据集：

d <- data.frame(Key=c('AAA','AAA','AAA','BBB'),DaysToEvent = c(120,110,100,100),PriceEUR = c(50,40,60,50), Pmin = c(50,40,40,30), Pmax = c(50,50,60,70))

[更新：data.table]

另一种方法是使用 data.table，它的性能非常出色：

library(data.table)
DT <- setDT(d)
DT[,c("Pmin","Pmax") := list(cummin(PriceEUR),cummax(PriceEUR)),by=Key]

DT
#    Key DaysToEvent PriceEUR Pmin Pmax
# 1: AAA         120       50   50   50
# 2: AAA         110       40   40   50
# 3: AAA         100       60   40   60
# 4: BBB         100       50   50   50

[更新2：基本R语言]

如果有某种原因你只想使用基本的R语言，这里还有另一种方法：

d$Pmin <- unlist(lapply(split(d$PriceEUR,d$Key),cummin))
d$Pmax <- unlist(lapply(split(d$PriceEUR,d$Key),cummax))