整洁宇宙(tidyverse):将特定日期匹配到事件周期

4

我有一些日期,想要与只有开始日期的事件匹配。作为一个简化的示例,假设我想找出在某些事件期间谁是总统,但我只有就职日期。

pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush", 
                            "Bill Clinton", "George W. Bush", "Barack 
                             Obama", "Donald Trump"), 
                     inaugdate = structure(c(4037, 6959, 8420, 11342, 14264, 
                                             17186), class = "Date"))

events <- data.frame(event = c("Challenger explosion", "Chernobyl 
                                explosion", "Hurricane Katrina", "9-11"), 
                      date = structure(c(5871, 5959, 13024, 11576), class = "Date"))

显然,简单的left_join无法奏效,因为这些事件并非在就职日发生。
events %>%
      left_join(pres, by = c("date" = "inaugdate"))

在Excel中,vlookup函数可以选择true(匹配最接近的)或false(匹配完全相同)。在tidyverse中有类似的功能吗?
3个回答

4
这里有一种方法可以实现所需结果,但可能需要稍微美化一下。您可以创建间隔,这是由 lubridate 提供的一种类,用于指定具有特定开始和结束时间的时间跨度。这带有 %within% 运算符,以查看日期是否在该间隔内。因此,我们首先创建此间隔,并使 pres 列成为字符类型,以便我们可以正确索引它。然后,我们使用 map_chr 对事件日期进行迭代,使用一个函数来说“检查此日期是否在每个间隔内,获取它实际在哪个间隔中的索引(使用 which),并返回相应的总统”。显然,这要求每个日期仅在一个间隔中找到,否则将失败。
library(tidyverse)
library(lubridate)

pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush", 
                            "Bill Clinton", "George W. Bush",
                            "Barack Obama", "Donald Trump"), 
                   inaugdate = structure(c(4037, 6959, 8420, 11342, 14264, 
                                           17186), class = "Date"))

events <- data.frame(event = c("Challenger explosion", "Chernobyl explosion",
                               "Hurricane Katrina", "9-11"), 
                     date = structure(c(5871, 5959, 13024, 11576), class = "Date"))

pres2 <- pres %>%
  mutate(
    presidency = interval(inaugdate, lead(inaugdate, default = today())),
    pres = as.character(pres)
  )
events %>%
  mutate(pres = map_chr(date, ~ pres2$pres[which(. %within% pres2$presidency)]))
#>                  event       date           pres
#> 1 Challenger explosion 1986-01-28  Ronald Reagan
#> 2  Chernobyl explosion 1986-04-26  Ronald Reagan
#> 3    Hurricane Katrina 2005-08-29 George W. Bush
#> 4                 9-11 2001-09-11 George W. Bush

该内容由reprex package(v0.2.1)于2019年02月04日创建。


1

可能不是最有效的方法,但我们可以使用一个带有sqldf的不等式连接:

library(sqldf)

sqldf('select a.event, a.date, b.pres
      from events a 
      left join pres b
      on a.date >= b.inaugdate
      group by a.event 
      having min(a.date - b.inaugdate)
      order by date, event')

输出:

                 event       date           pres
1 Challenger explosion 1986-01-28  Ronald Reagan
2  Chernobyl explosion 1986-04-26  Ronald Reagan
3                 9-11 2001-09-11 George W. Bush
4    Hurricane Katrina 2005-08-29 George W. Bush

0

也许不是很高效(取决于行数和列数),但这是解决问题的另一种方法。

library(dplyr) 

pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush", 
                            "Bill Clinton", "George W. Bush", "Barack Obama", "Donald Trump"), 
                   inaugdate = structure(c(4037, 6959, 8420, 11342, 14264, 
                                           17186), class = "Date")) %>% 
                  #lead date to get interval
                  mutate(enddt = lead(inaugdate, default = Sys.Date())-1)

events <- data.frame(event = c("Challenger explosion", "Chernobyl explosion", "Hurricane Katrina", "9-11"), 
                     date = structure(c(5871, 5959, 13024, 11576), class = "Date"))          
#get every combination of rows
newdf <- merge(pres,events,all = TRUE) %>% 
  filter(date >= inaugdate, date < enddt)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接