使用观测值数量过滤ggplot2密度图

Question

使用观测值数量过滤ggplot2密度图

3

在ggplot2调用中，是否可能过滤掉数据子集中观测次数较少的部分？

例如，考虑以下图表：qplot(price,data=diamonds,geom="density",colour=cut) 这个图表有点繁琐，我想排除观测次数较少的cut值，即：

> xtabs(~cut,diamonds)
cut
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

切割因子的“优”和“良”质量。

我希望有一个能够适应任意数据集的解决方案，如果可能的话，不仅可以通过观测值的阈值数量进行选择，还可以选择前三个观测值。

- James

4个回答

3

这是我的建议。首先创建一个函数，返回观察次数更多的类别。

firstx <- function (category, data, x = 1:3) {
  tab <- xtabs(~category, data)

  dimnames(tab)$category[order(tab, decreasing = TRUE)[x]]
}

#Then use subset to subset the data and droplevels to drop unused levels
#so they don't clutter the legend.
ggplot(droplevels(subset(diamonds, cut %in% firstx(cut, diamonds))), 
       aes(price, color = cut)) + geom_density()

我希望这能有所帮助。

- Luciano Selzer

2

这似乎需要编写您自己的子集函数，可能是像这样的东西：

mySubset <- function(dat,largestK=3,thresh=NULL){
   if (is.null(thresh)){
      tbl <- sort(table(dat)) 
      return(dat %in% tail(names(tbl),largestK))
   }
   else{
      return(dat >= thresh)
   }
}

这可以在ggplot命令中这样使用：

ggplot(diamonds[mySubset(diamonds$cut),],...)

这段代码没有处理因子级别的删除，所以请注意。出于这个原因，我通常将分类变量保留为字符类型，除非我绝对需要它们被排序。

- joran

谢谢，这个按预期工作。您可以通过在“color”调用中重构“cut”来降低级别。 - James

1

## Top 3 cuts
tmp <- names(sort(summary(diamonds$cut), decreasing = T))[1:3]
tmp <- droplevels(subset(diamonds, cut == tmp))
ggplot(tmp, aes(price, color=cut)) + geom_density()

enter image description here

但是你考虑过分面吗？

ggplot(diamonds, aes(price, color=cut)) + geom_density() + facet_grid(~cut)

enter image description here

- Brandon Bertelsen

谢谢Brandon，但是我使用的数据中有相当多的因子水平，所以我真的希望有一种方法只选择最多的那些，否则空间和清晰度就成了问题。 - James

在你的问题中，你写了 top3 但是你指定了 Fair 和 Good，它们实际上是 bottom2。如果是后者，请在我的解决方案中删除 decreasing = T 并将 [1:3] 更改为 [1:2]。 - Brandon Bertelsen

因为我把xtabs输出放在句子中间，所以不太清楚，但是我想排除“Fair”和“Good”。不过你的新解决方案已经按预期工作了，谢谢！ - James

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- kohske · Accepted Answer

ggplot(subset(diamonds, cut %in% arrange(count(diamonds, .(cut)), desc(freq))[1:3,]$cut),
  aes(price, colour=cut)) + 
  geom_density() + facet_grid(~cut)

count会对数据框中的每个元素进行计数。
arrange根据指定列对数据框进行排序。
desc可以启用倒序排序。
最后，使用%in%方法筛选出切割点在前三位的行。