如何在箱线图上显示异常值的 ID

3
您如何在箱线图中查看离群值的ID?
structure(list(pot = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 5L, 
6L, 7L, 8L, 25L, 26L, 27L, 28L, 9L, 10L, 11L, 12L, 29L, 30L, 
31L, 32L, 13L, 14L, 15L, 16L, 33L, 34L, 35L, 36L, 17L, 18L, 19L, 
20L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 61L, 62L, 63L, 64L, 
45L, 46L, 47L, 48L, 65L, 66L, 67L, 68L, 49L, 50L, 51L, 52L, 69L, 
70L, 71L, 72L, 53L, 54L, 55L, 56L, 73L, 74L, 75L, 76L, 57L, 58L, 
59L, 60L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 101L, 102L, 
103L, 104L, 85L, 86L, 87L, 88L, 105L, 106L, 107L, 108L, 89L, 
90L, 91L, 92L, 109L, 110L, 111L, 112L, 93L, 94L, 95L, 96L, 113L, 
114L, 115L, 116L, 97L, 98L, 99L, 100L, 117L, 118L, 119L, 120L, 
121L, 122L, 123L, 124L, 141L, 142L, 143L, 144L, 125L, 126L, 127L, 
128L, 145L, 146L, 147L, 148L, 129L, 130L, 131L, 132L, 149L, 150L, 
151L, 152L, 133L, 134L, 135L, 136L, 153L, 154L, 155L, 156L, 137L, 
138L, 139L, 140L, 157L, 158L, 159L, 160L), rep = c(1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), cultivar = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Dinninup", 
"Riverina", "Seaton Park", "Yarloop"), class = "factor"), Waterlogging = structure(c(2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Non-waterlogged", 
"Waterlogged"), class = "factor"), P = c(12.1, 12.1, 12.1, 12.1, 
12.1, 12.1, 12.1, 12.1, 15.17, 15.17, 15.17, 15.17, 15.17, 15.17, 
15.17, 15.17, 18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 
18.24, 24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 
48.35, 48.35, 48.35, 48.35, 48.35, 48.35, 48.35, 48.35, 12.1, 
12.1, 12.1, 12.1, 12.1, 12.1, 12.1, 12.1, 15.17, 15.17, 15.17, 
15.17, 15.17, 15.17, 15.17, 15.17, 18.24, 18.24, 18.24, 18.24, 
18.24, 18.24, 18.24, 18.24, 24.39, 24.39, 24.39, 24.39, 24.39, 
24.39, 24.39, 24.39, 48.35, 48.35, 48.35, 48.35, 48.35, 48.35, 
48.35, 48.35, 12.1, 12.1, 12.1, 12.1, 12.1, 12.1, 12.1, 12.1, 
15.17, 15.17, 15.17, 15.17, 15.17, 15.17, 15.17, 15.17, 18.24, 
18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 24.39, 24.39, 
24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 48.35, 48.35, 48.35, 
48.35, 48.35, 48.35, 48.35, 48.35, 12.1, 12.1, 12.1, 12.1, 12.1, 
12.1, 12.1, 12.1, 15.17, 15.17, 15.17, 15.17, 15.17, 15.17, 15.17, 
15.17, 18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 18.24, 
24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 24.39, 48.35, 
48.35, 48.35, 48.35, 48.35, 48.35, 48.35, 48.35), total = c(3.66, 
2.02, 1.59, 1.67, 2.12, 2.46, 1.79, 2.09, 2.03, 2.13, 1.83, 2.34, 
2.66, 2.2, 1.79, 1.97, 2.17, 2.44, 1.49, 2.19, 2.92, 2.43, 1.58, 
2.07, 2.48, 2.49, 1.69, 2.1, 2.38, 2.52, 2.41, 2.46, 2.22, 2.07, 
1.97, 2.3, 2.48, 3.16, 1.76, 2.38, 2.81, 2.64, 2.59, 3.28, 3.18, 
2.57, 2.9, 3, 2.38, 2.72, 2.58, 2.73, 3.06, 3.01, 3.01, 2.77, 
2.95, 2.36, 2.91, 2.38, 3.33, 3.19, 3.17, 3.16, 3.16, 3.2, 2.58, 
3.71, 3.11, 2.7, 2.92, 1.93, 2.95, 2.57, 2.68, 2.48, 3.34, 2.75, 
2.52, 1.88, 1.19, 0.57, 0.64, 0.66, 1.13, 1.28, 0.85, 0.96, 1.34, 
2.14, 0.63, 1.27, 1.13, 0.64, 1.21, 1.95, 1.11, 0.91, 0.75, 0.63, 
1.06, 1.07, 1.05, 0.8, 1.41, 1.13, 0.75, 0.89, 1.98, 1.27, 1.01, 
1, 1.16, 0.64, 0.64, 1.02, 1.03, 1.13, 0.79, 0.6, 3.88, 2.79, 
2.73, 2.77, 3.54, 2.05, 1.51, 1.88, 3.86, 3.13, 1.97, 3.46, 3.98, 
3.6, 2.12, 2.86, 2.95, 1.65, 1.94, 2.53, 2.21, 1.94, 2.05, 2.22, 
3, 3.28, 1.55, 3.85, 2.4, 2.1, 1.98, 1.81, 2.48, 1.66, 2.06, 
1.23, 3.75, 1.99, 1.67, 1.93)), class = "data.frame", row.names = c(NA, 
-160L))

boxplot(total~cultivar*as.factor(P),data=x)

带离群值的箱线图

我想要的是这个...

期望结果

我尝试了下面的例子但不起作用...

boxplot(total~cultivar*as.factor(P),data=x,id=list(n=Inf))

识别散点图上的异常值可以更轻松地将其从分析中移除。但是,这并不像我想的那样简单。这篇文章要求我添加更多细节,但我认为已经足够了。


你想要绘制的这些变量是什么?lfp?wc?k5? - G5W
我正在尝试弄清楚如何标记异常值,就像第二个图中一样。 - Eliott Reed
@G5W 第一个图是他们从代码中得到的,第二个是期望的输出。 - M--
如果我在OP的数据上运行OP的代码,我无法得到OP的第一个图。 - G5W
@G5W 我也是这样,不过我把它归咎于采样问题。 耸肩 - r2evans
显示剩余3条评论
2个回答

4
您可以使用car包:
library(car)

Boxplot(total ~ cultivar*as.factor(P), id.method="y", data = x)

更新:

car::Boxplot中是否可能翻转坐标轴?

出于挑战的目的,我尝试了一些巧妙的方法。 最终,我能够旋转图表,但这不像ggplot2::coord_flip那样传统。 在这里,我只是旋转了图表,所以标签仍然按照以前的对齐方式排列。 我们可以更进一步,删除标签并重写它们,但这将违背此解决方案的整体简单性。

library(car)
library(gridGraphics)

p <- Boxplot(total ~ cultivar*as.factor(P), id.method="y", data = x)

grab_grob <- function(){
  grid.echo()
  grid.grab()
}

g <- grab_grob()
grid.newpage()
pushViewport(viewport(width=0.5,angle=90))
grid.draw(g)


1
我喜欢当我的答案有30多行代码,然后有人用一行代码就解决了问题。 - r2evans
2
@r2evans 抱歉,我不是要贬低你的努力 ;) - M--
你如何翻转这个图表,以便像下面的答案一样轻松阅读标签? - Eliott Reed
我认为你可以添加 horizontal = TRUE,因为我认为它在内部使用基本的R boxplot(它接受该参数)。 - r2evans
很遗憾,水平=TRUE在car包中的“Boxplot”无法使用。 - Eliott Reed
@EliottReed 很抱歉,car::Boxplot 不支持旋转。如果你加载 gridGraphics 并将箱线图存到一个变量中(例如 p),然后运行 str(p),它会返回异常值以及无法剪切图形旋转的一些警告等信息。这个答案解释了部分情况:https://dev59.com/Bm865IYBdhLWcg3wfemU#3793658。这是 package carer 应该添加到函数中的一些内容:https://github.com/cran/car。通过定义更好的大小或将图形拆分成块来解决可读性问题(一些想法)。祝好。 - M--

3

不幸的是,boxplot 返回一个列表结构,提供异常值的值(例如,boxplot(..., plot=FALSE)$out),但由于其他组中存在相等的值,因此这并没有帮助到这里的问题。(事实上,我发现在使用 $out 时总是有点冒险,除非只有一组。)

但是您可以使用 $stats 来获取盒须图参数并自行查找所有内容。不幸的是,这不是一个一行代码的解决方案。

首先,我不知道您所说的“id”是什么意思,因此我会向数据中添加一些内容:

x$id <- seq_len(nrow(x))

base R

bp <- boxplot(total ~ cultivar * as.factor(P), data = x)
lims <- data.frame(nm = bp$names, t(bp$stats[c(1,5),]))
tmpx <- merge(transform(x, nm = paste(cultivar, as.factor(P), sep = ".")), lims, by = "nm", all.x = TRUE)
tmpx <- subset(tmpx, total < X1 | total > X2)
tmpx$xval <- match(tmpx$nm, bp$names)
text(total ~ xval, id, data = tmpx, adj = c(-0.5, 0.5))

带异常值id的基本R箱线图

在箱线图上叠加文本可能会成为一个问题;您可以尝试移动和/或翻转坐标来控制这个问题。剪切(没有在这里显示,但是当文本标签消失在绘图区域外时)也可能成为一个问题,因此您可能需要手动控制绘图区域的限制。

dplyr

如果您喜欢以tidyverse方式查看数据整理,这里有一个替代方案,可以产生相同的图形。

library(dplyr)
bp <- boxplot(total ~ cultivar * as.factor(P), data = x)
x %>%
  mutate( nm = paste(cultivar, as.factor(P), sep = ".") ) %>%
  left_join(data.frame(nm = bp$names, t(bp$stats[c(1,5),]), stringsAsFactors = FALSE),
            by = "nm") %>%
  filter(total < X1 | total > X2) %>%
  mutate(xval = match(nm, bp$names)) %>%
  text(data = ., total ~ xval, as.character(id), adj = c(-0.5, 0.5))

(同一个情节。)

dplyrggplot2

library(dplyr)
library(ggplot2)
bp <- boxplot(total ~ cultivar * as.factor(P), data = x, plot = FALSE)
x %>%
  mutate( nm = paste(cultivar, as.factor(P), sep = ".") ) %>%
  left_join(data.frame(nm = bp$names, t(bp$stats[c(1,5),]), stringsAsFactors = FALSE),
            by = "nm") %>%
  mutate(outlier = total < X1 | total > X2) %>%
  ggplot(aes(interaction(cultivar, P), total)) +
  geom_boxplot() +
  geom_text(aes(label = id), hjust = -0.5, data = ~ filter(., outlier)) +
  coord_flip()

ggplot2箱线图,带异常值标识

我选择翻转坐标轴,这样标签就可以全部包含和显示,但这不是必需的方法。我使用了一个技巧,就是ggplot2函数的data=参数可以采用表达式(我认为它是波浪符函数),这允许在原地对主数据集进行子集化。在这里,我使用dplyr::filter,但在这种情况下,如果您没有使用dplyr,使用subsetbase R)同样容易。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接