在R中按ID和日期选择第一个正匹配项

Question

在R中按ID和日期选择第一个正匹配项

4

我有一个包含不同时间观测值的数据框。只要一个ID的“Match”列有正值，该ID在其后续日期的行必须被删除。以下是示例数据框：

      Date  ID  Match
2018-06-06  5    1
2018-06-06  6    0
2018-06-07  5    1
2018-06-07  6    0
2018-06-07  7    1
2018-06-08  5    0
2018-06-08  6    1
2018-06-08  7    1
2018-06-08  8    1

期望输出：

      Date  ID  Match
2018-06-06  5    1
2018-06-06  6    0
2018-06-07  6    0
2018-06-07  7    1
2018-06-08  6    1
2018-06-08  8    1

换句话说，因为ID=5在2018-06-06上有正匹配，所以ID=5的行将被删除，但是该ID的第一个正匹配行将被保留。

可重现的示例：

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)

感谢您的预先帮助。

- olive

2

你不需要使用 cbind，可以直接使用 data.frame(Date, ID, Match)。使用 cbind 会生成一个矩阵，所以三列数据都会变成因子或字符串。 - Frank

5个回答

4

这里提供一种替代方法，我们可以找到每个ID中Match等于1的最小行号（即具有正匹配的第一行），然后进行筛选：

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match))

library(dplyr)

df %>%
  group_by(ID) %>%                                     # for each ID
  mutate(min_row = min(row_number()[Match == 1])) %>%  # get the first row where you have 1
  filter(row_number() <= min_row) %>%                  # keep previous rows and that row
  ungroup() %>%                                        # forget the grouping
  select(-min_row)                                     # remove unnecessary column

# # A tibble: 6 x 3
#   Date       ID    Match
#   <fct>      <fct> <fct>
# 1 2018-06-06 5     1    
# 2 2018-06-06 6     0    
# 3 2018-06-07 6     0    
# 4 2018-06-07 7     1    
# 5 2018-06-08 6     1    
# 6 2018-06-08 8     1

您可以逐步运行代码以查看其运作方式。我创建了min_row列来帮助您理解。您可以将上述内容重写为：

df %>%
  group_by(ID) %>%                                    
  filter(row_number() <= min(row_number()[Match == 1])) %>%                
  ungroup()

- AntoniosK

2

受 @Frank 答案的启发

 library(dplyr)
 df %>% group_by(ID) %>% mutate(Flag = cumsum(as.numeric(Match))) %>%
        filter(Match==0 & Flag==0 | Match==1 & Flag==1)

 # A tibble: 6 x 4
 # Groups:   ID [4]
  Date       ID    Match  Flag
  <chr>      <chr> <chr> <dbl>
1 2018-06-06 5     1         1
2 2018-06-06 6     0         0
3 2018-06-07 6     0         0
4 2018-06-07 7     1         1
5 2018-06-08 6     1         1
6 2018-06-08 8     1         1

数据

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match),stringsAsFactors = F)

- A. Suliman

1

如果特定ID的第一个数字为1，且紧随其后的是该ID的0，则代码将崩溃。 - AntoniosK

@AntoniosK 您是正确的，请您现在检查一下。 - A. Suliman

1

是的，现在它可以工作了（即解决了我提到的问题）。 - AntoniosK

2

我有另一种使用dplyr的方法来实现它

library(dplyr)
df %>% 
  group_by(ID) %>% 
  # You can use order(Date) if you don't want to coerce Date into date object
  mutate(ord = order(Date), first_match = min(ord[Match > 0]), ind = seq_along(Date)) %>% 
  filter(ind <= first_match) %>%
  select(Date:Match)
# A tibble: 6 x 3
# Groups:   ID [4]
  Date          ID Match
  <chr>      <dbl> <dbl>
1 2018-06-06     5     1
2 2018-06-06     6     0
3 2018-06-07     6     0
4 2018-06-07     7     1
5 2018-06-08     6     1
6 2018-06-08     8     1

- Lambda Moses

1

Here is another dplyr option:

library(dplyr)  
df %>%
  mutate(Date = as.Date(Date)) %>% 
  group_by(ID) %>%
  mutate(first_match = min(Date[Match == 1])) %>% 
  filter((Match == 1 & Date == first_match) | (Match == 0 & Date < first_match)) %>% 
  ungroup() %>% 
  select(-first_match)

# A tibble: 6 x 3
  Date       ID    Match
  <date>     <fct> <fct>
1 2018-06-06 5     1    
2 2018-06-06 6     0    
3 2018-06-07 6     0    
4 2018-06-07 7     1    
5 2018-06-08 6     1    
6 2018-06-08 8     1

- sbha

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Frank · Accepted Answer

一种方法：

library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format

df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]

   ID       Date Match
1:  5 2018-06-06     1
2:  6 2018-06-06     0
3:  6 2018-06-07     0
4:  6 2018-06-08     1
5:  7 2018-06-07     1
6:  8 2018-06-08     1

我们希望删除第一个 Match == 1 后的行。 cumsum 函数对 Match 进行累积求和。在第一个 Match == 1 之前，累积求和为零。我们想要保留后一行，并使用 shift 函数将累积求和应用于前一行。