使用dplyr和rle对连续的失败进行总结

Question

使用dplyr和rle对连续的失败进行总结

3

我正在尝试构建一个流失模型，其中包括每个客户的最大连续UX失败次数，但遇到困难。以下是我的简化数据和期望输出：

library(dplyr)
df <- data.frame(customerId = c(1,2,2,3,3,3), date = c('2015-01-01','2015-02-01','2015-02-02', '2015-03-01','2015-03-02','2015-03-03'),isFailure = c(0,0,1,0,1,1))
> df
  customerId       date isFailure
1          1 2015-01-01         0
2          2 2015-02-01         0
3          2 2015-02-02         1
4          3 2015-03-01         0
5          3 2015-03-02         1
6          3 2015-03-03         1

期望的结果：

> desired.df
  customerId maxConsecutiveFailures
1          1                      0
2          2                      1
3          3                      2

我有些手忙脚乱，浏览其他RLLE问题并没有帮助到我-这是我“期望”的解决方案：

df %>% 
  group_by(customerId) %>%
  summarise(maxConsecutiveFailures = 
    max(rle(isFailure[isFailure == 1])$lengths))

- Jack Case

一个基本的R选项是

sapply(split(df$isFailure, df$customerId), function(x) {tmp <- with(rle(x==1), lengths[values]); if(length(tmp)==0) 0 else tmp})

。 - akrun

data.table 的另一个选项是

setDT(df)[, {tmp <- rleid(isFailure)*isFailure; tmp2 <- table(tmp[.N==1|tmp!=0]); max((names(tmp2)!=0)*tmp2)}, customerId][]

。 - akrun

2个回答

0

这是我的尝试，仅使用标准的dplyr函数：

df %>% 
  # grouping key(s):
  group_by(customerId) %>%
  # check if there is any value change
  # if yes, a new sequence id is generated through cumsum
  mutate(last_one = lag(isFailure, 1, default = 100), 
         not_eq = last_one != isFailure, 
         seq = cumsum(not_eq)) %>% 
  # the following is just to find the largest sequence
  count(customerId, isFailure, seq) %>% 
  group_by(customerId, isFailure) %>% 
  summarise(max_consecutive_event = max(n))

输出：

# A tibble: 5 x 3
# Groups:   customerId [3]
  customerId isFailure max_consecutive_event
       <dbl>     <dbl>                 <int>
1          1         0                     1
2          2         0                     1
3          2         1                     1
4          3         0                     1
5          3         1                     2

最后对isFailure值进行过滤即可得到所需结果（需要将0故障计数的客户重新添加回去）。

该脚本可以接受isFailure列的任何值，并计算连续拥有相同值的最大天数。

- Шура Ву

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

我们按照"customerId"进行分组，使用 do 在 "isFailure" 列上执行 rle。提取在 values 为 "TRUE" 的 lengths（lengths[values]），并创建 "Max" 列，使用 if/else 条件对没有任何1值的返回0。

 df %>%
    group_by(customerId) %>%
    do({tmp <- with(rle(.$isFailure==1), lengths[values])
     data.frame(customerId= .$customerId, Max=if(length(tmp)==0) 0 
                    else max(tmp)) }) %>% 
     slice(1L)
#   customerId Max
#1          1   0
#2          2   1
#3          3   2