在交易日期之前的最近6个月内的总金额

12

这是我的交易数据。它显示了从from列中的账户到to列中的账户所进行的交易,包括日期和金额信息。

data 

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
19521       6644    6934        2005-01-01  700.0
19524       6753    8456        2005-01-01  600.0
19523       9242    9333        2005-01-01  1000.0
…           …       …           …           …
1056317     7819    7454        2010-12-31  60.2
1056318     6164    7497        2010-12-31  107.5
1056319     7533    7492        2010-12-31  164.1

我想计算在特定交易日期之前的最近6个月内,from列中的账户收到了多少交易金额,并将此信息保存为新列。

以下代码非常适用于小型数据集(例如1000行):

library(dplyr)
library(purrr)
data %>% 
  mutate(total_trx_amount_received_in_last_sixmonth= map2_dbl(from, date, 
~sum(amount[to == .x & between(date, .y - 180, .y)])))

然而,由于我的数据超过了100万行,这段代码将需要超过几个小时才能完成。 我在互联网上搜索了一下,看看是否可以加快此代码的运行时间。我尝试了SO上关于如何使purrr map函数运行更快的建议。因此,我尝试了以下代码,而不是使用dplyrmutate,我使用了data.table来加速代码:

library(future)
library(data.table)
library(furrr)
data[, total_trx_amount_received_in_last_sixmonth:= furrr::future_pmap_dbl(list(from, date), 
~mean(amount[to == .x & between(date, .y-180, .y)])) ]

然而,速度并没有得到改善。

有什么建议可以使代码运行更快吗?

数据的dput()输出:

data <- data.frame(
  id = c(
    18529L, 13742L, 9913L, 956L, 2557L, 1602L, 18669L, 35900L,
    48667L, 51341L, 53713L, 60126L, 60545L, 65113L, 66783L, 83324L,
    87614L, 88898L, 89874L, 94765L, 100277L, 101587L, 103444L, 108414L,
    113319L, 121516L, 126607L, 130170L, 131771L, 135002L, 149431L,
    157403L, 157645L, 158831L, 162597L, 162680L, 163901L, 165044L,
    167082L, 168562L, 168940L, 172578L, 173031L, 173267L, 177507L,
    179167L, 182612L, 183499L, 188171L, 189625L, 193940L, 198764L,
    199342L, 200134L, 203328L, 203763L, 204733L, 205651L, 209672L,
    210242L, 210979L, 214532L, 214741L, 215738L, 216709L, 220828L,
    222140L, 222905L, 226133L, 226527L, 227160L, 228193L, 231782L,
    232454L, 233774L, 237836L, 237837L, 238860L, 240223L, 245032L,
    246673L, 247561L, 251611L, 251696L, 252663L, 254410L, 255126L,
    255230L, 258484L, 258485L, 259309L, 259910L, 260542L, 262091L,
    264462L, 264887L, 264888L, 266125L, 268574L, 272959L
  ),
  from = c(
    "5370", "5370", "5370", "8605", "5370", "6390", "5370", "5370", "8934",
    "5370", "5635", "6046", "5680", "8026", "9037", "5370", "7816", "8046",
    "5492", "8756", "5370", "9254", "5370", "5370", "7078", "6615", "5370",
    "9817", "8228", "8822", "5735", "7058", "5370", "8667", "9315", "6053",
    "7990", "8247", "8165", "5656", "9261", "5929", "8251", "5370", "6725",
    "5370", "6004", "7022", "7442", "5370", "8679", "6491", "7078", "5370",
    "5370", "5370", "5658", "5370", "9296", "8386", "5370", "5370", "5370",
    "9535", "5370", "7541", "5370", "9621", "5370", "7158", "8240", "5370",
    "5370", "8025", "5370", "5370", "5370", "6989", "5370", "7059", "5370",
    "5370", "5370", "9121", "5608", "5370", "5370", "7551", "5370", "5370",
    "5370", "5370", "9163", "9362", "6072", "5370", "5370", "5370", "5370",
    "5370"
  ),
  to = c(
    "9356", "5605", "8567", "5370", "5636", "5370", "8933", "8483", "5370",
    "7626", "5370", "5370", "5370", "5370", "5370", "9676", "5370", "5370",
    "5370", "5370", "9105", "5370", "9772", "6979", "5370", "5370", "7564",
    "5370", "5370", "5370", "5370", "5370", "8744", "5370", "5370", "5370",
    "5370", "5370", "5370", "5370", "5370", "5370", "5370", "7318", "5370",
    "8433", "5370", "5370", "5370", "7122", "5370", "5370", "5370", "8566",
    "6728", "9689", "5370", "8342", "5370", "5370", "5614", "5596", "5953",
    "5370", "7336", "5370", "7247", "5370", "7291", "5370", "5370", "6282",
    "7236", "5370", "8866", "8613", "9247", "5370", "6767", "5370", "9273",
    "7320", "9533", "5370", "5370", "8930", "9343", "5370", "9499", "7693",
    "7830", "5392", "5370", "5370", "5370", "7497", "8516", "9023", "7310",
    "8939"
  ),
  date = as.Date(c(
    "2005-05-31", "2005-08-05", "2005-09-12", "2005-10-05", "2005-11-12",
    "2005-11-26", "2005-11-30", "2006-01-31", "2006-03-31", "2006-04-11",
    "2006-04-30", "2006-05-28", "2006-05-31", "2006-06-10", "2006-06-15",
    "2006-08-31", "2006-09-09", "2006-09-13", "2006-09-18", "2006-10-07",
    "2006-10-31", "2006-10-31", "2006-11-08", "2006-11-30", "2006-12-11",
    "2007-01-05", "2007-01-13", "2007-01-24", "2007-01-29", "2007-01-31",
    "2007-03-24", "2007-04-13", "2007-04-14", "2007-04-23", "2007-04-30",
    "2007-04-30", "2007-05-06", "2007-05-09", "2007-05-13", "2007-05-23",
    "2007-05-27", "2007-05-31", "2007-06-03", "2007-06-05", "2007-06-13",
    "2007-06-22", "2007-06-30", "2007-06-30", "2007-07-13", "2007-07-22",
    "2007-07-31", "2007-08-13", "2007-08-14", "2007-08-21", "2007-08-31",
    "2007-08-31", "2007-08-31", "2007-09-05", "2007-09-13", "2007-09-14",
    "2007-09-20", "2007-09-30", "2007-09-30", "2007-09-30", "2007-10-05",
    "2007-10-13", "2007-10-20", "2007-10-27", "2007-10-31", "2007-10-31",
    "2007-10-31", "2007-11-05", "2007-11-12", "2007-11-13", "2007-11-19",
    "2007-11-30", "2007-11-30", "2007-11-30", "2007-12-05", "2007-12-13",
    "2007-12-19", "2007-12-24", "2007-12-31", "2007-12-31", "2007-12-31",
    "2008-01-04", "2008-01-05", "2008-01-05", "2008-01-09", "2008-01-09",
    "2008-01-10", "2008-01-11", "2008-01-12", "2008-01-13", "2008-01-17",
    "2008-01-18", "2008-01-18", "2008-01-21", "2008-01-27", "2008-01-31"
  )),
  amount = c(
    24.4, 7618, 21971, 5245, 2921, 8000, 169.2, 71.5, 14.6, 4214, 14.6, 13920,
    14.6, 24640, 1600, 261.1, 16400, 3500, 2700, 19882, 182, 14.6, 16927, 25653,
    3059, 2880, 9658, 4500, 12480, 14.6, 1000, 3679, 34430, 12600, 14.6, 19.2,
    4900, 826, 3679, 2100, 38000, 79, 11400, 21495, 3679, 200, 14.6, 100.6, 3679,
    5300, 108.9, 3679, 2696, 7500, 171.6, 14.6, 99.2, 2452, 3679, 3218, 700, 69.7,
    14.6, 91.5, 2452, 3679, 2900, 17572, 14.6, 14.6, 90.5, 2452, 49752, 3679,
    1900, 14.6, 870, 85.2, 2452, 3679, 1600, 540, 14.6, 14.6, 79, 210, 2452,
    28400, 720, 180, 420, 44289, 489, 3679, 840, 2900, 150, 870, 420, 14.6
  )
)

1
数据子集化操作通常很慢。每次循环(map是一个优化的C循环),您都需要完全对数据进行子集化,这需要很长时间。如果您改变数据以便可以进行group_by和mutate操作,那么速度会更快。tsibble包或zoo包非常适合进行基于时间的操作。 - Adam Sampson
请从您的数据中创建一个可重现的示例... 您可以使用 dput() - s_baldur
我做到了 @sindri_baldur - rlock
1
好问题,我已经更改了标题以更好地反映您的问题,您可以随意编辑以使其更好。 - jangorecki
4个回答

11

这只是data.table中的一个非等值连接。您可以创建一个date - 180的变量,并在当前日期和该变量之间限制连接。这应该相当快。

library(data.table)
setDT(dt)[, date_minus_180 := date - 180]
dt[, amnt_6_m := .SD[dt, sum(amount, na.rm = TRUE), 
     on = .(to = from, date <= date, date >= date_minus_180), by = .EACHI]$V1]
head(dt, 10)
#        id from   to       date  amount date_minus_180 amnt_6_m
#  1: 18529 5370 9356 2005-05-31    24.4     2004-12-02      0.0
#  2: 13742 5370 5605 2005-08-05  7618.0     2005-02-06      0.0
#  3:  9913 5370 8567 2005-09-12 21971.0     2005-03-16      0.0
#  4:   956 8605 5370 2005-10-05  5245.0     2005-04-08      0.0
#  5:  2557 5370 5636 2005-11-12  2921.0     2005-05-16   5245.0
#  6:  1602 6390 5370 2005-11-26  8000.0     2005-05-30      0.0
#  7: 18669 5370 8933 2005-11-30   169.2     2005-06-03  13245.0
#  8: 35900 5370 8483 2006-01-31    71.5     2005-08-04  13245.0
#  9: 48667 8934 5370 2006-03-31    14.6     2005-10-02      0.0
# 10: 51341 5370 7626 2006-04-11  4214.0     2005-10-13   8014.6

1
这是一个比我的答案更好的解决方案 - 可以得到相同的结果,但要少费些周折。 - bcarlsen
2
谢谢,这个像魔法一样奏效,只用了1秒钟就完成了! - rlock
1
绝妙的解决方案 - jangorecki
嗨,David!你能否请检查我发布的新问题?@DavidArenburg,我在那里遇到了类似的问题。 - rlock

3

以下是使用 data.table 的一种选择:

library(data.table)
setDT(df)
setkey(df, to, date)

# Unique combination of from and date
af <- df[, unique(.SD), .SDcols = c("from", "date")]

# For each combination check sum of incoming in the last 6 months
for (i in 1:nrow(af)) {
  set(
    af, i = i, j = "am6m", 
    value = df[(date) %between% (af$date[[i]] - c(180, 0)) & to == af$from[[i]], sum(amount)]
  )
}
# Join the results into the main data.frame
df[, am6m := af[.SD, on = .(from, date), am6m]]



> tail(df)
#        id from   to       date  amount    am6m
# 1:  18529 5370 9356 2005-05-31    24.4     0.0
# 2: 258484 5370 9499 2008-01-09   720.0 74543.5
# 3: 251611 5370 9533 2007-12-31    14.6 46143.5
# 4:  83324 5370 9676 2006-08-31   261.1 40203.8
# 5: 203763 5370 9689 2007-08-31    14.6 92353.1
# 6: 103444 5370 9772 2006-11-08 16927.0 82671.2

它运行了将近45分钟,似乎不会很快结束。 - rlock
我取消了代码,在运行了一个小时后。所以,考虑到我需要运行30个类似的代码块,这是一段巨大的时间。 - rlock

3

这里有一个使用窗口函数的选项。

然而,它们需要完整的日数据才能工作,所需内存量可能很大(每个人都必须有每天的一行)。

还要注意,此方法仅适用于大型数据集或直接在数据库上执行计算。花费大量时间设置原始数据以消除间隙。并且在最后加入数据需要时间。

然而,滑动函数在处理数据大小时速度相对稳定。与子集不同,子集的大小增加,所需时间也会增加。

library(tidyverse)
library(tsibble)

# Calculate the 6 month window
six_mo_rollup <- data %>% 
  ## NOTE: You have to deal with duplicates somehow...either remove
  ## false duplicates or make them not duplicates...
  # We can get a unique from/date combo by summing since we need
  # to sum anyway.
  group_by(from,date) %>%
  summarise(amount = sum(amount),
            .groups = "keep") %>%
  ungroup() %>%
  # Now that each from/date is unique
  # convert data to a tsibble object
  as_tsibble(key = c(from),index = date) %>%
  # window functions can't have any missing time periods...so fill gaps
  # window functions grab 180 rows...not 180 days from the date
  group_by_key() %>%
  fill_gaps(.full = TRUE) %>%
  ungroup() %>%
  # arrange data from lowest to highest so slide can work right.
  arrange(date) %>%
  group_by(from) %>%
  mutate(
    six_mo_sum = slide_dbl(
      amount,
      sum,
      na.rm = TRUE, 
      .size = 180, 
      .align = "right"
    )
  ) %>%
  ungroup() %>%
  # any row without amount was created by fill_gaps in the example
  # so we can drop those rows to save space
  filter(!is.na(amount))

six_mo_rollup %>% filter(from == "5370")
# # A tsibble: 41 x 4 [1D]
# # Key:       from [1]
# from  date        amount six_mo_sum
#  <chr>  <date>      <dbl>      <dbl>
# 1 5370  2005-05-31    24.4        NA 
# 2 5370  2005-08-05  7618          NA 
# 3 5370  2005-09-12 21971          NA 
# 4 5370  2005-11-12  2921          NA 
# 5 5370  2005-11-30   169.      32679.
# 6 5370  2006-01-31    71.5     32751.
# 7 5370  2006-04-11  4214        7376.
# 8 5370  2006-08-31   261.       4475.
# 9 5370  2006-10-31   182         443.
# 10 5370  2006-11-08 16927       17370.
# # ... with 31 more rows

# Join the windowed data to the original dataset
data <- data %>%
  left_join(
    six_mo_rollup %>% select(from,date,six_mo_sum),
    by = c("from","date")
  )

更新:

在评论中,大家表明您想要对每个“for”值进行求和。我最初并没有理解这一点。代码的更新是将所有的聚合改为to而不是for

此外,您希望存在着不完整6个月数据的数值。所以您需要添加.partial = TRUE

# Calculate the 6 month window
six_mo_rollup <- data %>% 
  ## NOTE: You have to deal with duplicates somehow...either remove
  ## false duplicates or make them not duplicates...
  # We can get a unique from/date combo by summing since we need
  # to sum anyway.
  group_by(to,date) %>%
  summarise(amount = sum(amount),
            .groups = "keep") %>%
  ungroup() %>%
  # Now that each from/date is unique
  # convert data to a tsibble object
  as_tsibble(key = c(to),index = date) %>%
  # window functions can't have any missing time periods...so fill gaps
  # window functions grab 180 rows...not 180 days from the date
  group_by_key() %>%
  fill_gaps(.full = TRUE) %>%
  ungroup() %>%
  # arrange data from lowest to highest so slide can work right.
  arrange(date) %>%
  group_by(to) %>%
  mutate(
    six_mo_sum = slide_dbl(
      amount,
      sum,
      na.rm = TRUE, 
      .size = 180, 
      .align = "right",
      .partial = TRUE
    )
  ) %>%
  ungroup() %>%
  # any row without amount was created by fill_gaps in the example
  # so we can drop those rows to save space
  filter(!is.na(amount))

six_mo_rollup %>% filter(to == "5370")
# # A tsibble: 50 x 4 [1D]
# # Key:       to [1]
# to    date        amount six_mo_sum
# <chr> <date>       <dbl>      <dbl>
# 1 5370  2005-10-05  5245        5245 
# 2 5370  2005-11-26  8000       13245 
# 3 5370  2006-03-31    14.6     13260.
# 4 5370  2006-04-30    14.6      8029.
# 5 5370  2006-05-28 13920       13949.
# 6 5370  2006-05-31    14.6     13964.
# 7 5370  2006-06-10 24640       38604.
# 8 5370  2006-06-15  1600       40204.
# 9 5370  2006-09-09 16400       56604.
# 10 5370  2006-09-13  3500       60104.
# # ... with 40 more rows

# Join the windowed data to the original dataset
data <- data %>%
  left_join(
    six_mo_rollup %>% select(to,date,six_mo_sum),
    by = c("from" = "to","date" = "date")
  )

计算中是否有任何错误?由于six_mo_sum应按指定日期顺序为0, 0, 0, 5245.0, 13245.0, 13245.0, 8014.6, 40203.8, 82671.2, 82671.2 - rlock
如果没有6个月的数据可用,幻灯片函数默认返回NA。对于from ==“5370”,第一个日期是2005-05-31,因此对于此from值,2005-11-27之前的任何日期都将接收到NA值。 - Adam Sampson
另外,我不知道您打算如何处理“to”值或重复的起始日期/结束日期组合。因此,我将每个起始日期/结束日期组合汇总并加以总结。如果您需要其他处理方式,则需要进行相应调整。 - Adam Sampson
啊!你正在总结表格接收了多少,而不是它已经转移了多少。 - Adam Sampson
帖子已更新。 - Adam Sampson

2
一个1m的记录数据集足够小,不需要并行化。有很多看起来“正确”但实际上并不是的方法...要小心!首先,你可能会想知道为什么你的原始方法很慢?R是一种解释性数组语言。为了以可接受的性能执行任何操作,您必须将向量传递给已在较低级别语言中预编译的快速函数。如果您在数据集上逐个元素地“映射”函数,则会失去大部分矢量化的优势 - purrr :: map,base :: lapply等根本具有与预分配的for循环相当的性能,即不太好。您正在进行100万次以上的单个函数调用(每个记录一个)。这样做的并行化只能通过减少一些开销来提高性能的倍数。
针对您的澄清问题:
- 每个帐户每天是否限制一次交易,还是任何给定日子都可以进行多次交易?我假设是后者,可以在同一天进行多次交易。 - “从列中收到的帐户在特定交易发生日期之前的最近6个月内的交易金额” - 我假设这意味着“忽略在执行附加到该字段的交易的同一日期进行的交易”,因为无法确定这些交易是何时执行的。
我的方法:首先按帐户和日期求和,然后按日期计算滚动总和,然后将其连接到后续日期。
install.packages("RcppRoll") # for roll_sum()
install.packages(tidyr)      # for complete()

library(dplyr)

start_date <- as.Date("2018-01-01")
end_date <- as.Date("2020-01-01")
window_size <- 180L

# your example dataset is way too small to assess performance.
# Here is a 100k record dataset.

big_data <- tibble(
  from = as.factor(sapply(1:1000L, function(x) sample(1:100L,100, replace = F))),
  to = as.factor(sapply(1:1000L, function(x) sample(1:100L,100, replace = F))),
  amount = sample(1:10000, 100000, replace = TRUE),
  date = sample(seq.Date(from = start_date, to = end_date, by = "days"), 100000, replace = TRUE)
) %>%
  arrange(date) %>%
  mutate(id = row_number()) %>% 
  ungroup()

# calculate daily sum of values from PRECEDING day for join
daily_summary <- big_data %>%
  group_by(to, date) %>%
  summarize(daily_sum = sum(amount, na.rm = TRUE)) %>%
  ungroup() %>%
  # backfill empty records for data going back 6 months from start
  # this is needed because roll_sum() has no partial mode implemented.
  # and populate missing account - date combinations
  complete(date = seq.Date(from = start_date - window_size, to = end_date, by = "days"), to, fill = list(daily_sum = 0)) %>%
  group_by(to) %>%
  arrange(date) %>%
  mutate(
    total_trx_amount_received_in_last_sixmonth = RcppRoll::roll_sum(daily_sum, align = "right", n = window_size, fill = NA),
    date = date + 1
  ) %>%
  filter(date >= start_date) %>%
  select(date = date, from = to, total_trx_amount_received_in_last_sixmonth)

results <- left_join(big_data, daily_summary, by = c("from", "date"))

现在,关于性能呢?至少对我来说比你所报告的要好得多。对于一个包含10万条记录(100个账户,2年的信息)的数据集,在我的笔记本上只需0.6秒。对于一个包含100万条记录(1000个账户,2年的信息)的数据集,使用microbenchmark只需7-8秒。可能不是最高效的方法,但考虑到我没有进行任何优化并且没有使用通常用于R中高性能二维操作的data.table,这是相当可接受的。
使用dplyr分组仍然意味着我们每个账户都要调用一次快速预编译函数RcppRoll::roll_sum(),从性能角度来看这并不理想,但至少我们每个账户只调用一次函数而不是每个单独记录都调用一次函数。您还可以查看RollingWindow软件包中实现的单遍滚动窗口函数,因为它们可能更快。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接