根据其他列，在数据框中替换列值

Question

根据其他列，在数据框中替换列值

3

我有一个按照姓名和时间排序的数据框。

set.seed(100)
df <- data.frame('name' = c(rep('x', 6), rep('y', 4)), 
                 'time' = c(rep(1, 2), rep(2, 3), 3, 1, 2, 3, 4),
                 'score' = c(0, sample(1:10, 3), 0, sample(1:10, 2), 0, sample(1:10, 2))
                 )
> df
   name time score
1     x    1     0
2     x    1     4
3     x    2     3
4     x    2     5
5     x    2     0
6     x    3     1
7     y    1     5
8     y    2     0
9     y    3     5
10    y    4     8

在df$score中有零值，后面跟着未知数量的实际值，例如df[1:4,]。有时，在两个df$score == 0之间会有重叠的df$name，例如df[6:7,]。

我想在df$score != 0的情况下更改df$time。具体来说，如果df$name匹配，则我想分配最接近的上一行df$score == 0的时间值。

以下代码可以产生良好的输出，但我的数据有数百万行，因此这种解决方案效率非常低。

score_0 <- append(which(df$score == 0), dim(df)[1] + 1)

for(i in 1:(length(score_0) - 1)) {
  df$time[score_0[i]:(score_0[i + 1] - 1)] <-
    ifelse(df$name[score_0[i]:(score_0[i + 1] - 1)] == df$name[score_0[i]], 
           df$time[score_0[i]], 
           df$time[score_0[i]:(score_0[i + 1] - 1)])
 }

> df
   name time score
1     x    1     0
2     x    1     4
3     x    1     3
4     x    1     5
5     x    2     0
6     x    2     1
7     y    1     5
8     y    2     0
9     y    2     5
10    y    2     8

score_0表示df$score == 0的索引位置。我们可以看到，df$time[2:4]现在都等于1，在df$time[6:7]中只有第一个发生了变化，因为第二个满足df$name == 'y'并且最接近的上一行满足df$score == 0且df$name == 'x'。最后两行也已正确更改。

- JPh

如果 df [ 7 , "time" ] 等于 **2**，那么它会被更改为 1 吗，因为它是 name==y 的第一个条目，还是您会保持不变？ - M--

@Masoud，您可以将其保持不变，因为 df$name 与最接近的 df$score == 0 上方行不匹配。 - JPh

只是一个建议，当使用示例或其他随机函数时，请使用set.seed，这样每个人都可以获得相同的输出。祝福你，欢迎加入社区。 - M--

@Masoud，感谢您提供set.seed()技巧和整洁的答案！ - JPh

2个回答

1

使用 dplyr 和 data.table 的解决方案：

library(data.table)
library(dplyr)

df %>%
  mutate(
    chck = score == 0,
    chck_rl = ifelse(score == 0, lead(rleid(chck)), rleid(chck))) %>% 
  group_by(name, chck_rl) %>% mutate(time = first(time)) %>% 
  ungroup() %>% 
  select(-chck_rl, -chck)

输出：

# A tibble: 10 x 3
   name   time score
   <chr> <dbl> <int>
 1 x         1     0
 2 x         1     2
 3 x         1     9
 4 x         1     7
 5 x         2     0
 6 x         2     1
 7 y         1     8
 8 y         2     0
 9 y         2     2
10 y         2     3

仅使用 data.table 的解决方案：

library(data.table)

setDT(df)[, chck_rl := ifelse(score == 0, shift(rleid(score == 0), type = "lead"), 
    rleid(score == 0))][, time := first(time), by = .(name, chck_rl)][, chck_rl := NULL]

输出：

   name time score
 1:    x    1     0
 2:    x    1     2
 3:    x    1     9
 4:    x    1     7
 5:    x    2     0
 6:    x    2     1
 7:    y    1     8
 8:    y    2     0
 9:    y    2     2
10:    y    2     3

- arg0naut91

1

考虑在许多场景中使用 cumsum 和 rleid 函数，它们是强大的工具，但对于这个特定的问题并不高效。您的 data.table 解决方案可以通过这种方式进行改进，但已经是一个很好的实现了。+1 - M--

非常感谢您的建议@Masoud。我一直将rleid放在cumsum之上，但经过一些基准测试，它确实表明它可能并不总是最好的替代品。 - arg0naut91

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- M-- · Accepted Answer

你可以像这样做：

library(dplyr)
df %>% group_by(name) %>% mutate(ID=cumsum(score==0)) %>% 
       group_by(name,ID) %>% mutate(time = head(time,1)) %>% 
       ungroup() %>%  select(name,time,score) %>% as.data.frame()

#       name time  score
# 1     x    1     0
# 2     x    1     8
# 3     x    1    10
# 4     x    1     6
# 5     x    2     0
# 6     x    2     5
# 7     y    1     4
# 8     y    2     0
# 9     y    2     5
# 10    y    2     9