当存在多个数据点时,如何将信息复制到其他数据点?

3

我有一个数据清洗问题。数据收集发生了三次,有时数据输入是不正确的。因此,如果学生的数据被收集了多次,则需要复制第二个数据点。

这是我的数据集的样子:

df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,  4,4, 5),
                 text = c("female","male","male", "female","female","female", "male","female","male", "female", "female"),
                 time = c("first","second","third", "first","second","third", "first","second","second", "third", "first"))
            
> df
   id   text   time
1   1 female  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3   male  first
8   3 female second
9   4   male second
10  4 female  third
11  5 female  first

因此ID为1、3和4的性别信息不正确。当有多个/不同的输入关于“性别”变量时,我需要复制“第二个”数据点。如果只有一个数据点,则应该保留在数据集中。

所需输出如下:

> df1
   id   text   time
1   1   male  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3 female  first
8   3 female second
9   4   male second
10  4   male  third
11  5 female  first

有什么建议吗?谢谢!

3个回答

2

为了好玩,这是另一种方法;

library(dplyr)

df %>% 
  filter(time =="second") %>% 
  select(-time) %>% 
  full_join(df, ., by ="id", suffix = c("_old", "")) %>% 
  mutate(text = coalesce(text, text_old)) %>% 
  select(names(df))

#>       id text   time  
#>  1     1 male   first 
#>  2     1 male   second
#>  3     1 male   third 
#>  4     2 female first 
#>  5     2 female second
#>  6     2 female third 
#>  7     3 female first 
#>  8     3 female second
#>  9     4 male   second
#> 10     4 male   third 
#> 11     5 female first

1
我们可以使用match
library(dplyr)
df %>% 
  group_by(id) %>%
  mutate(text = text[match("second", time, nomatch = 1)]) %>%
  ungroup

-输出

# A tibble: 11 × 3
      id text   time  
   <dbl> <chr>  <chr> 
 1     1 male   first 
 2     1 male   second
 3     1 male   third 
 4     2 female first 
 5     2 female second
 6     2 female third 
 7     3 female first 
 8     3 female second
 9     4 male   second
10     4 male   third 
11     5 female first 

或者使用 coalesce

df %>% 
  group_by(id) %>%
  mutate(text = coalesce(text[match("second", time)], text)) %>%
  ungroup

-输出

# A tibble: 11 × 3
      id text   time  
   <dbl> <chr>  <chr> 
 1     1 male   first 
 2     1 male   second
 3     1 male   third 
 4     2 female first 
 5     2 female second
 6     2 female third 
 7     3 female first 
 8     3 female second
 9     4 male   second
10     4 male   third 
11     5 female first 

嗨@akrun,我有一个开放性问题。你有时间看看这个吗?https://stackoverflow.com/questions/73790880/data-cleaning-question-read-from-excel-files-in-r - amisos55
有点具有挑战性。是的。 - amisos55
1
刚刚修复了那部分,需要排除生日信息。 - amisos55
你对这个问题的任何部分有什么建议吗?感谢您的时间! - amisos55
1
现在我把它分成了三个部分。感谢您的建议。https://stackoverflow.com/questions/73802817/data-cleaning-question-read-from-excel-files-in-r-1 - amisos55

1
使用 {dplyr},我们可以采用以下方法:
  1. 我们 group_by(id)
  2. ifelse 中检查是否存在元素 text,当 time == "second" 时,我们使用 length 来实现这一点
  3. 如果是这种情况,则使用 text[time == "second"],否则使用 text
我只是想知道,如果您有三个数据条目,firstsecond 相同,而 third 不同,那么上述方法将无法工作。
此外,如果 first"male"second"female",而 third 再次是 "male",应该选择哪一个?
下面的方法仅在可用时使用 second,并忽略其余部分。
library(dplyr)


df %>% 
  group_by(id) %>% 
  mutate(text = ifelse(length(text[time == "second"]) > 0,
                       text[time == "second"],
                       text))

#> # A tibble: 11 × 3
#> # Groups:   id [5]
#>       id text   time  
#>    <dbl> <chr>  <chr> 
#>  1     1 male   first 
#>  2     1 male   second
#>  3     1 male   third 
#>  4     2 female first 
#>  5     2 female second
#>  6     2 female third 
#>  7     3 female first 
#>  8     3 female second
#>  9     4 male   second
#> 10     4 male   third 
#> 11     5 female first

reprex包(v0.3.0)于2022-09-15创建


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接