当符合条件时，R语言运行缓慢

Question

当符合条件时，R语言运行缓慢

4

我有一个只包含体检日期和感染情况（是/否）的数据框，我想要添加第三列表示最后一次感染的日期。如果患者没有先前的感染记录，则新的“last_infection”列应该为NA。如果他们曾经感染过，它应该显示最近一次感染测试结果为“是”的日期。

我希望输出如下所示：

date      infection   last_infection
01-01-18  no          NA
06-01-18  no          NA
07-01-18  yes         NA
09-01-18  no          07-01-18
01-01-19  no          07-01-18
02-01-19  yes         07-01-18
03-01-19  yes         02-01-19
04-01-19  no          03-01-19
05-01-19  no          03-01-19

我该如何在R中实现这个功能？类似lag()的函数能否检查条件，或者我需要完全采用其他方法？

- bob

2个回答

0

我们可以基于使用'infection'创建的逻辑向量来创建一个分组变量，并使用它来lag该列。在这里，我们仅加载dplyr而不加载任何其他包。

library(dplyr)
df1 %>%
   group_by(grp = cumsum(infection == "yes")) %>%
   mutate(new = first(date)) %>%
   ungroup %>%
   mutate(new = replace(lag(new), seq_len(match(1, grp)), NA)) %>%
   select(-grp)
# A tibble: 9 x 4
#  date     infection last_infection new     
#  <chr>    <chr>     <chr>          <chr>   
#1 01-01-18 no        <NA>           <NA>    
#2 06-01-18 no        <NA>           <NA>    
#3 07-01-18 yes       <NA>           <NA>    
#4 09-01-18 no        07-01-18       07-01-18
#5 01-01-19 no        07-01-18       07-01-18
#6 02-01-19 yes       07-01-18       07-01-18
#7 03-01-19 yes       02-01-19       02-01-19
#8 04-01-19 no        03-01-19       03-01-19
#9 05-01-19 no        03-01-19       03-01-19

数据

df1 <- structure(list(date = c("01-01-18", "06-01-18", "07-01-18", "09-01-18", 
"01-01-19", "02-01-19", "03-01-19", "04-01-19", "05-01-19"), 
    infection = c("no", "no", "yes", "no", "no", "yes", "yes", 
    "no", "no"), last_infection = c(NA, NA, NA, "07-01-18", "07-01-18", 
    "07-01-18", "02-01-19", "03-01-19", "03-01-19")),
    class = "data.frame", row.names = c(NA, 
-9L))

- akrun

太好了，谢谢@akrun！你能解释一下first(date)部分是如何工作的吗？ - bob

1

@kss 发生的情况是，每当“感染”列中出现“yes”时，“grp”列的值就会增加1。因此，当我们进行group_by时，“grp” 1的“date”中的第一个观察值将在第3行，而在其上方则为grp 0（因为在“感染”中全部都是“no”）。这就是我使用first的原因。稍后，我们将用起始的“yes”替换前两个元素的值为NA。 - akrun

1

明白了，这非常有帮助。感谢您。 - bob

1

不清楚这里为什么会有负评。我展示了一种不使用更多外部包的方法。我记得在另一个问题这里上还有一个负评，因为其他答案展示了一些边缘情况。如果是同一个人在进行负评，我会报告它，因为这里的负评是不必要的。假设明天另一个人想出了一行代码，我们会对其他解决方案进行负评吗？ - akrun

如果患者从未感染过，此代码将引发错误，因为没有grp = 1。有没有方法可以修复这个问题@akrun？我想要“新”列中的所有NA。 - bob

1

@kss 在这种情况下，我会执行

df1％>%按（grp = cumsum（infection ==“yes”））％>% mutate（new = if(any(grp > 0)) first(date) else NA）％>%取消组合

。 - akrun

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Peter H. · Accepted Answer

我建议使用类似这样的方式。如果您使用tidyr包中的fill函数，就不需要使用cumsum或grouping了。

library(tidyverse)

df %>% 
  mutate(
    last_infection = if_else(lag(infection) == "yes", lag(date), NA_character_)
  ) %>% 
  fill(last_infection)
#> # A tibble: 9 x 3
#>   date     infection last_infection
#>   <chr>    <chr>     <chr>         
#> 1 01-01-18 no        <NA>          
#> 2 06-01-18 no        <NA>          
#> 3 07-01-18 yes       <NA>          
#> 4 09-01-18 no        07-01-18      
#> 5 01-01-19 no        07-01-18      
#> 6 02-01-19 yes       07-01-18      
#> 7 03-01-19 yes       02-01-19      
#> 8 04-01-19 no        03-01-19      
#> 9 05-01-19 no        03-01-19

^{这段内容是由reprex 包 (v0.3.0) 创建于2020-01-25。}