让ggplot2直方图在y轴上显示分类百分比

Question

让ggplot2直方图在y轴上显示分类百分比

14

library(ggplot2)
data = diamonds[, c('carat', 'color')]
data = data[data$color %in% c('D', 'E'), ]

我想比较D和E颜色的克拉直方图，并在y轴上使用类别百分比。我尝试过的解决方案如下：

解决方案1：

ggplot(data=data, aes(carat, fill=color)) +  geom_bar(aes(y=..density..), position='dodge', binwidth = 0.5) + ylab("Percentage") +xlab("Carat")

在这里输入图片描述

这不太准确，因为y轴显示的是估计密度的高度。

解决方案2：

 ggplot(data=data, aes(carat, fill=color)) +  geom_histogram(aes(y=(..count..)/sum(..count..)), position='dodge', binwidth = 0.5) + ylab("Percentage") +xlab("Carat")

图片描述

这也不是我想要的，因为用于在y轴上计算比率的分母是D + E的总计数。

是否有一种方法可以使用ggplot2的堆积直方图来显示按类别划分的百分比？也就是说，不是显示bin中obs的数量/ count（D + E）在y轴上，而是分别显示bin中obs的数量/ count（D）和bin中obs的数量/ count（E）对于两个颜色类。谢谢。

- Feng Mai

1

你考虑过在ggplot之外对你的数据进行汇总吗？ - Roman Luštrik

4个回答

9

看起来在ggplot2之外对数据进行分组是解决问题的方法。但我仍然很想知道是否有一种方法可以在ggplot2中完成。

library(dplyr)
breaks = seq(0,4,0.5)

data$carat_cut = cut(data$carat, breaks = breaks)

data_cut = data %>%
  group_by(color, carat_cut) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

ggplot(data=data_cut, aes(x = carat_cut, y=freq*100, fill=color)) + geom_bar(stat="identity",position="dodge") + scale_x_discrete(labels = breaks) +  ylab("Percentage") +xlab("Carat")

enter image description here

- Feng Mai

2

当我尝试Rorschach的答案时，它对我来说并没有起作用，原因并不明显，但我想评论一下，如果您愿意将密度线添加到直方图中，那么它将自动将y轴更改为百分比。

例如，我有一个按二进制结果（0,1）计算“剂量”的计数。

此代码生成以下图形：

ggplot(data, aes(x=siadoses, fill=recallbin, color=recallbin)) +
  geom_histogram(binwidth=1, alpha=.5, position='identity')

但是当我在我的ggplot代码中加入密度图并添加y =..density..时，我会得到这个带有Y轴百分比的图

ggplot(data, aes(x=siadoses, fill=recallbin, color=recallbin)) +
  geom_histogram(aes(y=..density..), binwidth=1, alpha=.5, position='identity') +
  geom_density(alpha=.2)

这是对你最初问题的一种解决方法，但我想分享一下。

- Megan Halbrook

在ggplot 3.4.3中，似乎不需要额外调用geom_density()。 - undefined

2

幸运的是，在我的情况下，Rorschach的答案完美地解决了我的问题。我在这里寻求避免Megan Halbrook提出的解决方案，因为我意识到它不是一个正确的解决方案。

将密度线添加到直方图会自动将y轴更改为频率密度，而不是百分比。只有当binwidth = 1时，频率密度的值才等同于百分比。

谷歌搜索：要绘制直方图，首先找到每个类别的类宽。条形图的面积表示频率，因此要找到条形图的高度，请将频率除以类宽。这称为频率密度。https://www.bbc.co.uk/bitesize/guides/zc7sb82/revision/9

以下是一个示例，左侧显示百分比，右侧显示y轴上的密度。

library(ggplot2)
library(gridExtra)

TABLE <- data.frame(vari = c(0,1,1,2,3,3,3,4,4,4,5,5,6,7,7,8))

## selected binwidth
bw <- 2

## plot using count
plot_count <- ggplot(TABLE, aes(x = vari)) + 
   geom_histogram(aes(y = ..count../sum(..count..)*100), binwidth = bw, col =1) 
## plot using density
plot_density <- ggplot(TABLE, aes(x = vari)) + 
   geom_histogram(aes(y = ..density..), binwidth = bw, col = 1)

## visualize together
grid.arrange(ncol = 2, grobs = list(plot_count,plot_density))

## visualize the values
data_count <- ggplot_build(plot_count)
data_density <- ggplot_build(plot_density)

## using ..count../sum(..count..) the values of the y axis are the same as 
## density * bindwidth * 100. This is because density shows the "frequency density".
data_count$data[[1]]$y == data_count$data[[1]]$density*bw * 100
## using ..density.. the values of the y axis are the "frequency densities".
data_density$data[[1]]$y == data_density$data[[1]]$density


## manually calculated percentage for each range of the histogram. Note 
## geom_histogram use right-closed intervals.
min_range_of_intervals <- data_count$data[[1]]$xmin

for(i in min_range_of_intervals)
  cat(paste("Values >",i,"and <=",i+bw,"involve a percent of",
            sum(TABLE$vari>i & TABLE$vari<=(i+bw))/nrow(TABLE)*100),"\n")

# Values > -1 and <= 1 involve a percent of 18.75 
# Values > 1 and <= 3 involve a percent of 25 
# Values > 3 and <= 5 involve a percent of 31.25 
# Values > 5 and <= 7 involve a percent of 18.75 
# Values > 7 and <= 9 involve a percent of 6.25

- MarinaGA

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rorschach · Accepted Answer

从统计数据中计算

您可以使用特殊的统计变量group和count按组进行缩放，使用group选择count的子集。

如果您拥有ggplot 3.3.0或更新版本，则可以使用after_stat函数访问这些特殊变量：

ggplot(data, aes(carat, fill=color)) +
  geom_histogram(
    aes(y=after_stat(c(
      count[group==1]/sum(count[group==1]),
      count[group==2]/sum(count[group==2])
    )*100)),
    position='dodge',
    binwidth=0.5
  ) +
  ylab("Percentage") + xlab("Carat")

展示了 Carat 和 Percentage 的 ggplot 图，有两组条形图，每组都显示所需颜色的百分比

使用较旧版本的 ggplot

在早期版本中，这更加繁琐 - 如果您至少拥有 3.0 版本，则可以将 stat() 函数包裹在每个单独的变量引用周围，在 3.0 之前的版本中，则必须用两个点代替:

aes(y=c(
  ..count..[..group..==1]/sum(..count..[..group..==1]),
  ..count..[..group..==2]/sum(..count..[..group..==2])
)*100),

这些统计数据具体是什么意思呢？

关于这些变量的来源，随着使用的统计函数，摘要统计信息将被记录 - 例如geom_histogram的默认stat_bin()具有此Computed variables部分：

计算变量：

count 在区间内的点的数量

density 区间内点的密度，按比例缩放为1

ncount 计数，按最大值缩放为1

ndensity 密度，按最大值缩放为1

width 区间宽度

除此之外，你可以使用ggplot_build（）来检查为任何给定绘图生成的所有统计数据：

> p = ggplot(data, [...etc...])
> ggplot_build(p)
$data
$data[[1]]
        fill           y count      x  xmin xmax      density       ncount
1  #440154FF  1.50553506   102 -0.125 -0.25 0.00 0.0301107011 0.0224323730
2  #440154FF 67.11439114  4547  0.375  0.25 
[...snip...]
       ndensity flipped_aes PANEL group ymin        ymax colour size linetype
1  0.0224323730       FALSE     1     1    0  1.50553506     NA  0.5        1
2  1.0000000000       FALSE     1     1    0 67.11439114     NA  0.5        1
[...snip...]