如何在R中拥有足够信息来计算缺失值应该是什么时，填充缺失值。

Question

如何在R中拥有足够信息来计算缺失值应该是什么时，填充缺失值。

4

我有一个数据集，其中有一些NA值，但我可以手动计算出这些值应该是什么，因为数据框中有一个名称列，其余列只是数字，最后一列是总数。每行最多只有一个NA，所以我可以使用总列和所有其他列的总和来计算出该值应该是什么。只是想知道在不必一个一个硬编码的情况下，最好的方法是什么，因为我使用的数据框非常大。

示例数据框：

df = structure(list(city = c("sydney", "new york", "london", "beijing", "paris", "madrid"), 
                    year = c(2005:2010), 
                    A = c(1, 4, 5 , NA, 2, 1), 
                    B = c(3, NA, 4 , 9, 0, 6),
                    C = c(3, 4 , 6, 1, 8, NA),
                    total = c(NA, 10, 15, 14, NA, 15)), 
               class = "data.frame", row.names = c(NA, -6L))

df

- P_S_13

看一下tidyr包中的fill和replace_na函数。 - Maël

1

另外，使用zoo包中的na.approx函数也可以帮助解决问题。您可以执行类似于df = df％>% mutate（A = na.approx（A））的操作来插值A列的值，以及其他列的值。 - thehand0

4个回答

2

这个解决方案也可以帮助你。

library(purrr)
library(dplyr)

df %>%
  rowwise() %>%
  mutate(total = ifelse(is.na(total), sum(c_across(A:C)), total), 
         pmap_df(select(cur_data(), A:total), ~ {x <- c(...)[1:3]
         replace(x, is.na(x), c(...)[4] - sum(x, na.rm = TRUE))}))

# A tibble: 6 x 6
# Rowwise: 
  city      year     A     B     C total
  <chr>    <int> <dbl> <dbl> <dbl> <dbl>
1 sydney    2005     1     3     3     7
2 new york  2006     4     2     4    10
3 london    2007     5     4     6    15
4 beijing   2008     4     9     1    14
5 paris     2009     2     0     8    10
6 madrid    2010     1     6     8    15

有点儿过于硬编码，但可以在这方面进行修改。

- Anoushiravan R

2

data.table解决方案

library(data.table)
setDT(df)

cols <- c("A", "B", "C")

df[, (cols) := lapply(.SD, function(x) {
  ifelse(is.na(x), total - rowSums(.SD, na.rm = T), x)
}), .SDcols = cols][is.na(total), total := rowSums(.SD), .SDcols = cols]

df
#        city year A B C total
# 1:   sydney 2005 1 3 3     7
# 2: new york 2006 4 2 4    10
# 3:   london 2007 5 4 6    15
# 4:  beijing 2008 4 9 1    14
# 5:    paris 2009 2 0 8    10
# 6:   madrid 2010 1 6 8    15

数据

df = structure(list(
  city = c("sydney", "new york", "london", "beijing", "paris", "madrid"), 
  year = c(2005:2010), 
  A = c(1, 4, 5 , NA, 2, 1), 
  B = c(3, NA, 4 , 9, 0, 6),
  C = c(3, 4 , 6, 1, 8, NA),
  total = c(NA, 10, 15, 14, NA, 15)), 
  class = "data.frame", row.names = c(NA, -6L)
)

- Merijn van Tilborg

1

这里有一个使用基本R解决方案的apply。

df = structure(list(city = c("sydney", "new york", "london", "beijing", "paris", "madrid"), 
                    year = c(2005:2010), 
                    A = c(1, 4, 5 , NA, 2, 1), 
                    B = c(3, NA, 4 , 9, 0, 6),
                    C = c(3, 4 , 6, 1, 8, NA),
                    total = c(NA, 10, 15, 14, NA, 15)), 
               class = "data.frame", row.names = c(NA, -6L))

df[-(1:2)] <- t(apply(df[-(1:2)], 1, \(x) {
  if(is.na(x[4])) {
    x[4] <- sum(x[-4])
  } else if(anyNA(x[-4])) {
      x[-4][is.na(x[-4])] <- x[4] - sum(x[-4][!is.na(x[-4])])
  }
  x
}))
df
#>       city year A B C total
#> 1   sydney 2005 1 3 3     7
#> 2 new york 2006 4 2 4    10
#> 3   london 2007 5 4 6    15
#> 4  beijing 2008 4 9 1    14
#> 5    paris 2009 2 0 8    10
#> 6   madrid 2010 1 6 8    15

^{本文创建于2022年2月9日，使用reprex软件包（版本为v2.0.1）}

- Rui Barradas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Claudiu Papasteri · Accepted Answer

首先你需要先替换总列中的NA，然后就可以简单地计算剩下的内容了。你还可以为A、B、C列创建一个函数，这样你就不需要重复编写代码，但由于只有3列，这应该不是问题。

df = structure(list(city = c("sydney", "new york", "london", "beijing", "paris", "madrid"), 
                    year = c(2005:2010), 
                    A = c(1, 4, 5 , NA, 2, 1), 
                    B = c(3, NA, 4 , 9, 0, 6),
                    C = c(3, 4 , 6, 1, 8, NA),
                    total = c(NA, 10, 15, 14, NA, 15)), 
               class = "data.frame", row.names = c(NA, -6L))

df
#>       city year  A  B  C total
#> 1   sydney 2005  1  3  3    NA
#> 2 new york 2006  4 NA  4    10
#> 3   london 2007  5  4  6    15
#> 4  beijing 2008 NA  9  1    14
#> 5    paris 2009  2  0  8    NA
#> 6   madrid 2010  1  6 NA    15

df$total <- ifelse(is.na(df$total), rowSums(df[, c("A", "B", "C")]), df$total)
df$A <- ifelse(is.na(df$A), df$total - rowSums(df[, c("A", "B", "C")], na.rm = TRUE), df$A)
df$B <- ifelse(is.na(df$B), df$total - rowSums(df[, c("A", "B", "C")], na.rm = TRUE), df$B)
df$C <- ifelse(is.na(df$C), df$total - rowSums(df[, c("A", "B", "C")], na.rm = TRUE), df$C)

df
#>       city year A B C total
#> 1   sydney 2005 1 3 3     7
#> 2 new york 2006 4 2 4    10
#> 3   london 2007 5 4 6    15
#> 4  beijing 2008 4 9 1    14
#> 5    paris 2009 2 0 8    10
#> 6   madrid 2010 1 6 8    15

^{本文创建于2022年2月9日，使用reprex包（v2.0.1）}

更新：在替换总列中的NA后，您可以使用zoo包中的na.approx函数来插值其余的值。

library(zoo)

df$total <- ifelse(is.na(df$total), rowSums(df[, c("A", "B", "C")]), df$total)   # first totals
df[, c("A", "B", "C")] <- na.approx(df[, c("A", "B", "C", "total")], rule = 2)   # then rest
df
      city year   A   B C total
1   sydney 2005 1.0 3.0 3     7
2 new york 2006 4.0 3.5 4    10
3   london 2007 5.0 4.0 6    15
4  beijing 2008 3.5 9.0 1    14
5    paris 2009 2.0 0.0 8    10
6   madrid 2010 1.0 6.0 8    15