当条件发生后如何筛选每个组的行

8

我是一个新手,正在尝试使用R编程语言。在满足筛选条件后,我想删除一组行中的特定行。

场景:对于每个组,在一行中有2个类型为“B”的情况下,删除该组之后的所有行。 "包含在数据集中"列显示了输出结果。

以下是我的示例输入:

GROUP   TYPE    Include in DataSet?
--------------------------------------------
1       A       yes
1       A       yes
1       B       yes
1       B       yes
1       B       no
2       A       yes
2       B       yes
2       B       yes
2       A       no
2       B       no
2       B       no

DF = structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A", 
"B", "B"), inc = c("yes", "yes", "yes", "yes", "no", "yes", "yes", 
"yes", "no", "no", "no")), .Names = c("GROUP", "TYPE", "inc"), row.names = c(NA, 
-11L), class = "data.frame")

期望的输出结果:

GROUP   TYPE    Include in DataSet?
--------------------------------------------
1       A       yes
1       A       yes
1       B       yes
1       B       yes
2       A       yes
2       B       yes
2       B       yes

我尝试编写一些代码,但由于分组问题而没有成功。

i=1
j=2
x <- allrows
for (i in x){
  for(j in x){
    if(i==j){
      a$REMOVE=1
    }
    else{
      a$REMOVE=2
    }
  }
}
4个回答

9

您可以通过创建一个新变量来标识“双B”行,并过滤掉分组中第一个“双B”行之后的行来实现:

library(dplyr)
df %>%
    group_by(GROUP) %>%
    # Create new variable that tests if each row and the one below it TYPE==B
    mutate(double_B = (TYPE == 'B' & lag(TYPE) == 'B')) %>%
    # Find the first row with `double_B` in each group, filter out rows after it
    filter(row_number() <= min(which(double_B == TRUE))) %>%
    # Optionally, remove `double_B` column when done with it
    select(-double_B)

# A tibble: 7 x 3
# Groups:   GROUP [2]
  GROUP TYPE  IncludeinDataSet
  <int> <chr> <chr>           
1     1 A     yes             
2     1 A     yes             
3     1 B     yes             
4     1 B     yes             
5     2 A     yes             
6     2 B     yes             
7     2 B     yes       

正如@Frank在评论中指出的那样,您不需要创建double_B变量:您可以在filter内部的which语句中测试“double B”条件:

df %>%
    group_by(GROUP) %>%
    # Find the first row with `double_B` in each group, filter out rows after it
    filter(row_number() <= min(which(TYPE == 'B' & lag(TYPE) == 'B')))

此外,如果在一组中没有找到“双B”条件,它将返回一个警告,但仍然能正确过滤。


关于 "optionally",另一种方法是使用没有分配名称的条件:df %>% group_by(GROUP)%>% filter(row_number()<= min(which(TYPE =='B' & lag(TYPE)=='B')))。顺便说一下,如果从未出现过两个 B,则会出现警告(例如,请尝试 min(which(FALSE))),尽管我不确定是否有方法可以避免这种情况。 - Frank

4

可以通过检查“TYPE”的当前值与下一个“TYPE”的值来找到数值索引,使用seq_len获取从1到该数字的序列,以对行进行子集化(在slice中)。

library(dplyr)
df1 %>% 
  group_by(GROUP) %>% 
  slice(seq_len(which((TYPE == "B") & lead(TYPE) == "B")[1] + 1))
# A tibble: 7 x 3
# Groups:   GROUP [2]
#  GROUP TYPE  IncludeInDataSet
#  <int> <chr> <chr>           
#1     1 A     yes             
#2     1 A     yes             
#3     1 B     yes             
#4     1 B     yes             
#5     2 A     yes             
#6     2 B     yes             
#7     2 B     yes          

数据

df1 <- structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
 2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A", 
 "B", "B"), IncludeInDataSet = c("yes", "yes", "yes", "yes", "no", 
  "yes", "yes", "yes", "no", "no", "no")), class = "data.frame", 
 row.names = c(NA, -11L))

很棒的回答!您能修改这段代码,使得在每个组中一旦出现第一个“B”,就删除其后的所有内容吗? - stats_noob
例如,这是否正确? - stats_noob
1
库(dplyr) df1 %>% 按(GROUP)分组 %>% 切片(seq_len(which((TYPE == "B") & lead(TYPE) == "B")[1] + 0)) - stats_noob
@stats_noob 你可以做到这一点。 - akrun

2
另一种方法可能是:
library(dplyr)
library(data.table)

df %>%
  group_by(GROUP, rleid(TYPE)) %>%
  mutate(temp = seq_along(TYPE)) %>%
  ungroup() %>%
  group_by(GROUP) %>%
  filter(row_number() <= min(which(TYPE == "B" & temp == 2))) %>%
  select(GROUP, TYPE, IncludeInDataSet)

很棒的答案!如果我想修改您的答案,使得在每个组中,第一个“B”后面的所有行都被删除,这样正确吗? - stats_noob
df %>% 按(GROUP, rleid(TYPE))分组 %>% mutate(temp = seq_along(TYPE)) %>% 取消分组() %>% 按(GROUP)分组 %>% 过滤(row_number() <= min(which(TYPE == "B" & temp == 1))) %>% 选择(GROUP, TYPE, IncludeInDataSet) - stats_noob

1

Here's a base R solution:

subset(DF, as.logical(ave(DF$TYPE,DF$GROUP, FUN= function(x) 
  seq_along(x) <= which((sequence(rle(x=="B")$length) * (x=="B")) %in% 2)[1])))
#   GROUP TYPE inc
# 1     1    A yes
# 2     1    A yes
# 3     1    B yes
# 4     1    B yes
# 6     2    A yes
# 7     2    B yes
# 8     2    B yes

很棒的答案!如果我想修改你的答案,使得在每个组中,第一个“B”后面的所有行都被删除,这样正确吗? - stats_noob
subset(DF, as.logical(ave(DF$TYPE, DF$GROUP, FUN = function(x) seq_along(x) <= which((sequence(rle(x=="B")$length) * (x=="B")) %in% 1)[1]))) - stats_noob

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接