如何找到符合条件的一组元素的第一个元素?

7
structure(list(group = c(17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L, 
19L, 19L, 20L, 20L, 20L, 21L, 21L, 22L, 23L, 24L, 25L, 25L, 25L, 
26L, 27L, 27L, 27L, 28L), var = c(74L, 49L, 1L, 74L, 1L, 49L, 
61L, 49L, 1L, 5L, 5L, 1L, 44L, 44L, 12L, 13L, 5L, 5L, 1L, 1L, 
4L, 4L, 1L, 1L, 1L, 49L), first = c(0, 0, 1, 0, 1, 0, 0, 0, 1, 
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)), .Names = c("group", 
"var", "first"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-26L))

根据前两列的数据,我想创建第三列(称为first),只有在组中首次出现var == 1时,first == 1。换句话说,我想标记满足var == 1条件的组内第一个元素。在dplyr中如何实现?当然应该使用group_by,但是接下来怎么做呢?


1
你期望的输出是什么?能分享一下吗? - Saurabh Chauhan
它显示在名为“first”的列中。 - jakes
3个回答

4
library(dplyr)

df$first = NULL

df %>%
  group_by(group) %>%
  mutate(first = as.numeric(row_number() == min(row_number()[var == 1]))) %>%
  ungroup()

# # A tibble: 26 x 3
#   group   var first
#   <int> <int> <dbl>
# 1    17    74     0
# 2    17    49     0
# 3    17     1     1
# 4    18    74     0
# 5    18     1     1
# 6    18    49     0
# 7    18    61     0
# 8    19    49     0
# 9    19     1     1
# 10   19     5     0
# # ... with 16 more rows

这个想法是在每个组内标记var=1的最小行号。

由于某些组中没有 var = 1 的情况,因此会返回一些警告。

另一个选项是:

library(dplyr)

df$first = NULL

# create row id
df$id = seq_along(df$group)

df %>%
  filter(var == 1) %>%                         # keep cases where var = 1
  distinct(group, .keep_all = T) %>%           # keep distinct cases based on group
  mutate(first = 1) %>%                        # create first column
  right_join(df, by=c("id","group","var")) %>% # join back original dataset
  mutate(first = coalesce(first, 0)) %>%       # replace NAs with 0
  select(-id)                                  # remove row id

# # A tibble: 26 x 3
#   group   var first
#   <int> <int> <dbl>
# 1    17    74     0
# 2    17    49     0
# 3    17     1     1
# 4    18    74     0
# 5    18     1     1
# 6    18    49     0
# 7    18    61     0
# 8    19    49     0
# 9    19     1     1
#10    19     5     0
# # ... with 16 more rows

3
对于未分组的数据,一种解决方法是:
first_equal_to = function(x, value)
    (x == value) & (cumsum(x == value) == 1)

所以
tbl %>% group_by(group) %>% mutate(first = first_equal_to(var, 1))

看起来将此列保留为逻辑向量是合适的,因为这就是该列所代表的。

另一种实现方法是

first_equal_to2 = function(x, value) {
    result = logical(length(x))
    result[match(value, x)] = TRUE
    result
}

2
我们可以使用如下表达式来表示first:
DF %>% 
  group_by(group) %>% 
  mutate(first = { var == 1 } %>% { . * !duplicated(.) } ) %>%
  ungroup

提供:

# A tibble: 26 x 3
   group   var first
   <int> <int> <int>
 1    17    74     0
 2    17    49     0
 3    17     1     1
 4    18    74     0
 5    18     1     1
 6    18    49     0
 7    18    61     0
 8    19    49     0
 9    19     1     1
10    19     5     0
# ... with 16 more rows

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接