根据 R 中的条件计算日期之间的平均差异

4

我有这个数据集:

Date New_Renew
 2019-01-10 22:11:16  Renewing
 2019-02-23 00:21:48  Renewing
 2019-03-05 05:26:17  Renewing
 2019-04-18 15:05:10       NEW
 2019-04-18 15:07:52       NEW
 2019-04-26 11:32:25  Renewing
 2019-05-03 14:15:25  Renewing
 2019-05-08 21:10:08       NEW
 2019-05-16 13:35:57  Renewing
 2019-05-24 13:18:23  Renewing
 2019-06-01 12:42:21  Renewing
 2019-06-17 18:08:09  Renewing
 2019-06-26 13:40:29  Renewing
 2019-12-13 17:57:43  Renewing
 2020-01-03 11:49:14  Renewing
 2020-01-11 11:46:51  Renewing
 2020-01-14 21:08:08       NEW
 2020-01-18 21:14:30       NEW
 2020-01-21 16:08:37       NEW
 2020-01-28 11:41:44  Renewing
 2020-01-30 13:34:21  Renewing
 2020-02-03 13:29:37  Renewing
 2020-02-18 17:15:52  Renewing
 2020-02-20 13:37:52  Renewing
 2020-02-24 12:55:25  Renewing
 2020-02-26 21:13:38       NEW
 2020-03-04 13:23:41  Renewing
 2020-03-09 16:48:36  Renewing

我希望的是,当New_Renew变量等于NEW时,计算与NEW相关联的日期之间差异的平均值。简单来说,用户进行NEW交易的频率有多高。

你说的“NEW相关日期之间的平均差”是什么意思?每个NEW实例只有一个关联日期,所以对于第一个NEW实例,您想要的差异是关联日期和什么之间的差异? - Greg
例如,计算第一和第二个新观测值之间的差异,然后计算第二和第三个新观测值之间的差异,以此类推。最后,我想要这些差异的平均值。 - Alex
对于你的“简单来说”的问题:“用户有多频繁进行新交易?”答案也可以是:“每4次交易中有1次”,因为用户在28次交易中进行了7次新交易。或者,如果考虑整个时间段,用户在424天内进行了7次新交易。这大约是每60天进行1次新交易。这比平均值52天要长,因为第一次和最后一次新交易之前的时间段没有计算在内。只是随便想想。 - Edward
4个回答

2
使用aggregatediff60 * 24将产生的秒数转换为天数。
aggregate(Date ~ New_Renew, dat, function(x) mean(diff(x))/(60*24))
#  New_Renew         Date
# 1       NEW 52.38292438 
# 2  Renewing  0.01471444 

或者使用 tapply 函数。
with(dat, tapply(Date, New_Renew, function(x) mean(diff(x))/(60*24)))
#         NEW    Renewing 
# 52.38292438  0.01471444 

数据

dat <- structure(list(Date = structure(c(1547154676, 1550877708, 1551759977, 
1555592710, 1555592872, 1556271145, 1556885725, 1557342608, 1558006557, 
1558696703, 1559385741, 1560787689, 1561549229, 1576256263, 1578048554, 
1578739611, 1579032488, 1579378470, 1579619317, 1580208104, 1580387661, 
1580732977, 1582042552, 1582202272, 1582545325, 1582748018, 1583324621, 
1583768916), class = c("POSIXct", "POSIXt"), tzone = ""), New_Renew = c("Renewing", 
"Renewing", "Renewing", "NEW", "NEW", "Renewing", "Renewing", 
"NEW", "Renewing", "Renewing", "Renewing", "Renewing", "Renewing", 
"Renewing", "Renewing", "Renewing", "NEW", "NEW", "NEW", "Renewing", 
"Renewing", "Renewing", "Renewing", "Renewing", "Renewing", "NEW", 
"Renewing", "Renewing")), row.names = c(NA, -28L), class = "data.frame")

2

使用 dplyr

library(dplyr)
df %>%
  filter(New_Renew=="NEW") %>%
  mutate(Date_next=lead(Date), duration=Date_next-Date) %>%
  summarise(mean=mean(duration, na.rm=TRUE))

# A tibble: 1 x 1
  mean         
  <drtn>       
1 75421.41 mins # ~ 52 days

总结前:

# A tibble: 7 x 4
  Date                New_Renew Date_next           duration       
  <dttm>              <chr>     <dttm>              <drtn>         
1 2019-04-18 15:05:10 NEW       2019-04-18 15:07:52      2.700 mins
2 2019-04-18 15:07:52 NEW       2019-05-08 21:10:08  29162.267 mins
3 2019-05-08 21:10:08 NEW       2020-01-14 21:08:08 361438.000 mins
4 2020-01-14 21:08:08 NEW       2020-01-18 21:14:30   5766.367 mins
5 2020-01-18 21:14:30 NEW       2020-01-21 16:08:37   4014.117 mins
6 2020-01-21 16:08:37 NEW       2020-02-26 21:13:38  52145.017 mins
7 2020-02-26 21:13:38 NEW       NA                          NA mins

1
先创建最小可复现示例。
DT <- read.table(text = 'Date, New_Renew
 2019-01-10 22:11:16,Renewing
 2019-02-23 00:21:48,Renewing
 2019-03-05 05:26:17,Renewing
 2019-04-18 15:05:10,NEW
 2019-04-18 15:07:52,NEW
 2019-04-26 11:32:25,Renewing
 2019-05-03 14:15:25,Renewing
 2019-05-08 21:10:08,NEW
 2019-05-16 13:35:57,Renewing
 2019-05-24 13:18:23,Renewing
 2019-06-01 12:42:21,Renewing
 2019-06-17 18:08:09,Renewing
 2019-06-26 13:40:29,Renewing
 2019-12-13 17:57:43,Renewing
 2020-01-03 11:49:14,Renewing
 2020-01-11 11:46:51,Renewing
 2020-01-14 21:08:08,NEW
 2020-01-18 21:14:30,NEW
 2020-01-21 16:08:37,NEW
 2020-01-28 11:41:44,Renewing
 2020-01-30 13:34:21,Renewing
 2020-02-03 13:29:37,Renewing
 2020-02-18 17:15:52,Renewing
 2020-02-20 13:37:52,Renewing
 2020-02-24 12:55:25,Renewing
 2020-02-26 21:13:38,NEW
 2020-03-04 13:23:41,Renewing
 2020-03-09 16:48:36,Renewing', 
                 sep = ',', 
                 header = T)

然后,我们将使用xts对日期进行方便的计算。for循环很简单,存储第一个续订并从第一个新续订中减去。
library(data.table)
library(xts)
library(lubridate)
library(tbl2xts)

df <- xts(DT, order.by = ymd_hms(DT$Date))

new_items <- which(DT$New_Renew=="NEW")

dif <- DT

dif$difference <- NA

renewal <- 0

for (i in 1:nrow(df)){

  if (df[i,2]=='Renewing' & renewal == 0){
    renewal <- i
  } else if (df[i,2]=='Renewing' & renewal != 0){
    next
  } else if (df[i, 2]=='NEW' & renewal != 0) {
    dif[i, 'difference'] <- index(df[i, 2]) - index(df[renewal, 2])
    renewal <- i
  } else {
    dif[i, 'difference'] <- index(df[i, 2]) - index(df[renewal, 2])
    renewal <- 0
  }

}

mean_diff <- mean(dif$difference, na.rm = T)

1
一种使用 data.table 的选项:
DT[, Date := as.POSIXct(Date, format="%Y-%m-%d %T")][
    New_Renew=="NEW", mean(diff(Date))]

数据:

library(data.table)
DT <- fread("Date,New_Renew
2019-01-10 22:11:16,Renewing
2019-02-23 00:21:48,Renewing
2019-03-05 05:26:17,Renewing
2019-04-18 15:05:10,     NEW
2019-04-18 15:07:52,     NEW
2019-04-26 11:32:25,Renewing
2019-05-03 14:15:25,Renewing
2019-05-08 21:10:08,     NEW
2019-05-16 13:35:57,Renewing
2019-05-24 13:18:23,Renewing
2019-06-01 12:42:21,Renewing
2019-06-17 18:08:09,Renewing
2019-06-26 13:40:29,Renewing
2019-12-13 17:57:43,Renewing
2020-01-03 11:49:14,Renewing
2020-01-11 11:46:51,Renewing
2020-01-14 21:08:08,     NEW
2020-01-18 21:14:30,     NEW
2020-01-21 16:08:37,     NEW
2020-01-28 11:41:44,Renewing
2020-01-30 13:34:21,Renewing
2020-02-03 13:29:37,Renewing
2020-02-18 17:15:52,Renewing
2020-02-20 13:37:52,Renewing
2020-02-24 12:55:25,Renewing
2020-02-26 21:13:38,     NEW
2020-03-04 13:23:41,Renewing
2020-03-09 16:48:36,Renewing")

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接