为什么summary（tibble（））不报告chr列中的NA？

Question

为什么summary（tibble（））不报告chr列中的NA？

4

我正在清理存储在tibble中的数据，但是我反复将一些空字符串观测值转换为NA，然后在调用summary(df)检查我的工作时，这些观测值似乎消失了。看起来，在使用tibble()时，只有非字符列才会报告NA。为什么会这样？这是故意的吗？如果是，为什么？

最小示例：

tdf <- tibble::tibble(a = c("apple", "pear", NA), 
                      b = 1:3, c = factor(letters[1:3]))
# We see that the NA in the 'chr' column is not displayed
summary(tdf) 
#>       a                   b       c    
#>  Length:3           Min.   :1.0   a:1  
#>  Class :character   1st Qu.:1.5   b:1  
#>  Mode  :character   Median :2.0   c:1  
#>                     Mean   :2.0        
#>                     3rd Qu.:2.5        
#>                     Max.   :3.0
# But NA in other column types will be
tdf[3, 2:3] <- NA
summary(tdf)
#>       a                   b           c    
#>  Length:3           Min.   :1.00   a   :1  
#>  Class :character   1st Qu.:1.25   b   :1  
#>  Mode  :character   Median :1.50   c   :0  
#>                     Mean   :1.50   NA's:1  
#>                     3rd Qu.:1.75           
#>                     Max.   :2.00           
#>                     NA's   :1

# This behavior is not the same with data.frame
ddf <- data.frame(a = c("apple", "pear", NA), 
                  b = 1:3, c = factor(letters[1:3]))
summary(ddf)
#>      a           b       c    
#>  apple:1   Min.   :1.0   a:1  
#>  pear :1   1st Qu.:1.5   b:1  
#>  NA's :1   Median :2.0   c:1  
#>            Mean   :2.0        
#>            3rd Qu.:2.5        
#>            Max.   :3.0
ddf[3, 2:3] <- NA
summary(ddf)
#>      a           b           c    
#>  apple:1   Min.   :1.00   a   :1  
#>  pear :1   1st Qu.:1.25   b   :1  
#>  NA's :1   Median :1.50   c   :0  
#>            Mean   :1.50   NA's:1  
#>            3rd Qu.:1.75           
#>            Max.   :2.00           
#>            NA's   :1

此内容由 reprex 包（v0.2.0）于2018年3月1日创建。

- gfgm

tdf %>% group_by(a) %>% tally 将会给你 NA 的计数。 - loki

2个回答

1

为什么？
可能是设计选择。

如何解决：
您可以使用lapply和table()，并使用参数useNA="always"或"ifany"：

tdf <- tibble::tibble(a = c("apple", "pear", NA, NA), 
                      b = 1:4, c = factor(letters[1:4]), 
                      d = c("apple", "pear", "peach", NA))
lapply(tdf, function(x){table(x, useNA = "always")})
# $a
# x
# apple  pear  <NA> 
#     1     1     2 
# $b
# x
#   1    2    3    4 <NA> 
#   1    1    1    1    0 
# $c
# x
#   a    b    c    d <NA> 
#   1    1    1    1    0 
# $d
# x
# apple peach  pear  <NA> 
#     1     1     1     1

在分组后，您还可以使用dplyr :: tally检查单个列。

tdf %>% group_by(a) %>% tally
# # A tibble: 3 x 2
#       a     n
#   <chr> <int>
# 1 apple     1
# 2  pear     1
# 3  <NA>     2

- loki

我很感激你抽出时间来回答我的问题，但是我的问题关于为什么而不是如何。 - gfgm

我知道。不过，我想把它包含在未来的访问者中。 - loki

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marc P · Accepted Answer

1

这是因为当你在数据框中创建列'a'时，它们被定义为因子（请参见stringsAsFactors）。当你在tibble中创建列时，它是一个字符类型的列。

class(ddf$a)
"factor"

class(tdf$a)
"character"

如果你使用stringsAsFactors = FALSE创建数据框，你会发现它的行为类似于tibble。

ddf <- data.frame(a = c("apple", "pear", NA), 
              b = 1:3, c = factor(letters[1:3]), stringsAsFactors = FALSE)

class(ddf$a)
"character"

- Marc P

它解释了这种差异，同时表明无论数据结构如何，“summary”从不显示字符列中NA的数量，这是好事——所以我会点赞。但我仍然想知道为什么“summary()”会报告这个问题：例如，如果我调用“sum(is.na(ddf$a))”，我可以得到任何字符向量中NA的数量。 - gfgm

当您调用is.na()时，它会返回TRUE或FALSE。在数字中，TRUE为1，FALSE为0。因此，总和将是向量中有多少个TRUE。尝试：as.numeric(TRUE)和as.numeric(FALSE)。 - Marc P

1

是的，这正是我的观点。例如，sapply(df, function(x){sum(is.na(x))})会告诉我整个数据框或表中每列有多少个NA。关键不在于“如何计算NA”（这很容易），而在于为什么摘要报告不输出一个字符向量中的NA。 - gfgm

总结报告了因为NA是向量的另一个级别，所以因子向量中的NA数量。您可以看到它还报告了梨和苹果的数量。可以看出，NAs被报告为“NA's”，我想这是因为它不依赖于使用的级别而编码的。在字符类型向量中，这个“特性”没有被实现。为什么呢？我不知道。 - Marc P

我认为NA不是因子的一个级别。例如，c <- factor(c("a", "b", NA)); levels(c)仅返回"a"和"b"。 - gfgm

你是完全正确的。我写成了"NA"而不是NA，这是我的错误。我猜这可能与模式有关。tdf$a的模式是字符型，而ddf$a的模式是数值型。 - Marc P