使用ggplot2为矩阵相关热图添加显著性水平

31
我想知道如何在矩阵相关热图中添加另一层重要且必要的复杂性,例如在R2值(-1至1)之外以显著水平星号方式加入p值。请注意,在本问题中并不打算将显著性水平星号或p值作为文本放置在矩阵的每个方格中,而是通过图形化的方法来显示矩阵每个方格中的显著性水平,这需要具备创新思维才能找到最佳解决方案。我搜索了很多,但从未见过一种适当或者说“眼睛友好”的方法来表示显著性水平和反映R系数的标准色调。可重现的数据集在此处找到:http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/。以下是R代码:
library(ggplot2)
library(plyr) # might be not needed here anyway it is a must-have package I think in R 
library(reshape2) # to "melt" your dataset
library (scales) # it has a "rescale" function which is needed in heatmaps 
library(RColorBrewer) # for convenience of heatmap colors, it reflects your mood sometimes
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba <- as.data.frame(cor(nba[2:ncol(nba)])) # convert the matrix correlations to a dataframe 
nba.m <- data.frame(row=rownames(nba),nba) # create a column called "row"
rownames(nba) <- NULL #get rid of row names
nba <- melt(nba)
nba.m$value<-cut(nba.m$value,breaks=c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),include.lowest=TRUE,label=c("(-0.75,-1)","(-0.5,-0.75)","(-0.25,-0.5)","(0,-0.25)","(0,0.25)","(0.25,0.5)","(0.5,0.75)","(0.75,1)")) # this can be customized to put the correlations in categories using the "cut" function with appropriate labels to show them in the legend, this column now would be discrete and not continuous
nba.m$row <- factor(nba.m$row, levels=rev(unique(as.character(nba.m$variable)))) # reorder the "row" column which would be used as the x axis in the plot after converting it to a factor and ordered now
#now plotting
ggplot(nba.m, aes(row, variable)) +
geom_tile(aes(fill=value),colour="black") +
scale_fill_brewer(palette = "RdYlGn",name="Correlation")  # here comes the RColorBrewer package, now if you ask me why did you choose this palette colour I would say look at your battery charge indicator of your mobile for example your shaver, won't be red when gets low? and back to green when charged? This was the inspiration to choose this colour set.

矩阵相关性热图应该长这样:
enter image description here

提示和想法以增强解决方案:
- 此代码可能有用,可从此网站获取关于显著性水平星号的想法:
http://ohiodata.blogspot.de/2012/06/correlation-tables-in-r-flagged-with.html
R 代码:
mystars <- ifelse(p < .001, "***", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " "))) # so 4 categories  

- 可以将显著性水平添加为每个正方形的颜色强度,就像透明度美学一样,但我认为这可能不容易解释和捕捉
- 另一个想法是有4种不同大小的正方形对应于星星,当然给非显著性最小的,并增加到最高星星的全尺寸正方形
- 另一个想法是在那些显著的正方形内部包含一个圆圈,圆圈的线条粗细与显著性水平相对应(剩下的3个类别),都是同一种颜色
- 与上述相同,但固定线条厚度,同时为剩余的3个显著级别提供3种颜色
- 也许你会想出更好的主意,谁知道呢?


2
你的代码启发了我使用ggplot2重写arm::corrplot函数:http://rpubs.com/briatte/ggcorr - Fr.
它运行得很好!您能否将此函数扩展以使那些不显著的相关性(例如<0.05)消失,同时保持那些等于或高于该值的相关性。在这里,应该使用另一个矩阵BUT与p值一起提供给函数。我与您分享了这个函数,它可以帮助您获得p矩阵(您可以使用:cor.prob.all() cor.prob.all <- function(X,dfr = nrow(X) - 2){R <- cor(X,use =“pairwise.complete.obs”,method =“spearman”)r2 <- R ^ 2 Fstat <- r2 * dfr /(1-r2)R <- 1-pf(Fstat,1,dfr)R [row(R)== col(R)] <- NA R})。 - doctorate
感谢您的反馈。我对此处(以及其他地方)使用$p$值持怀疑态度,但我会尝试想出一些方法来标记不显著的系数。 - Fr.
上述函数现在成为GGally包的一部分,由包的维护者进行了修正和补充。 - Fr.
(-1,-0.75)这个颜色在哪里?使用c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),我们应该有8个间隔和8种颜色,而不是7种... - Gildas
3个回答

31
这只是一个朝着最终解决方案的尝试,我在这里绘制了星星作为解决方案的指示器,但正如我所说,目标是找到一个比星星更能表达的图形解决方案。我只是使用geom_point和alpha来表示显著性水平,但问题是NAs(包括非显著值)会显示为三颗星的显著性水平,如何修复这个问题?我认为使用一种颜色可能更加友好,当使用多种颜色时,避免给图形负担过多的细节以供眼睛解析。提前感谢。
这是我第一次尝试的图:
enter image description here

或者也许这个更好?!
enter image description here

我认为到目前为止最好的是下面这个,直到你想出更好的东西为止! enter image description here

根据要求,以下代码是用于最后一个热图:
# Function to get the probability into a whole matrix not half, here is Spearman you can change it to Kendall or Pearson
cor.prob.all <- function (X, dfr = nrow(X) - 2) {
R <- cor(X, use="pairwise.complete.obs",method="spearman")
r2 <- R^2
Fstat <- r2 * dfr/(1 - r2)
R<- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
# Change matrices to dataframes
nbar<- as.data.frame(cor(nba[2:ncol(nba)]),method="spearman") # to a dataframe for r^2
nbap<- as.data.frame(cor.prob.all(nba[2:ncol(nba)])) # to a dataframe for p values
# Reset rownames
nbar <- data.frame(row=rownames(nbar),nbar) # create a column called "row" 
rownames(nbar) <- NULL
nbap <- data.frame(row=rownames(nbap),nbap) # create a column called "row" 
rownames(nbap) <- NULL
# Melt
nbar.m <- melt(nbar)
nbap.m <- melt(nbap)
# Classify (you can classify differently for nbar and for nbap also)         
nbar.m$value2<-cut(nbar.m$value,breaks=c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),include.lowest=TRUE, label=c("(-0.75,-1)","(-0.5,-0.75)","(-0.25,-0.5)","(0,-0.25)","(0,0.25)","(0.25,0.5)","(0.5,0.75)","(0.75,1)")) # the label for the legend
nbap.m$value2<-cut(nbap.m$value,breaks=c(-Inf, 0.001, 0.01, 0.05),label=c("***", "** ", "*  ")) 
nbar.m<-cbind.data.frame(nbar.m,nbap.m$value,nbap.m$value2) # adding the p value and its cut to the first dataset of R coefficients
names(nbar.m)[5]<-paste("valuep") # change the column names of the dataframe 
names(nbar.m)[6]<-paste("signif.")
nbar.m$row <- factor(nbar.m$row, levels=rev(unique(as.character(nbar.m$variable)))) # reorder the variable factor
# Plotting the matrix correlation heatmap
# Set options for a blank panel
po.nopanel <-list(theme(panel.background=theme_blank(),panel.grid.minor=theme_blank(),panel.grid.major=theme_blank()))
pa<-ggplot(nbar.m, aes(row, variable)) +
geom_tile(aes(fill=value2),colour="white") +
scale_fill_brewer(palette = "RdYlGn",name="Correlation")+ # RColorBrewer package
theme(axis.text.x=theme_text(angle=-90))+
po.nopanel
pa # check the first plot
# Adding the significance level stars using geom_text 
pp<- pa +
geom_text(aes(label=signif.),size=2,na.rm=TRUE) # you can play with the size
# Workaround for the alpha aesthetics if it is good to represent significance level, the same workaround can be applied for size aesthetics in ggplot2 as well. Applying the alpha aesthetics to show significance is a little bit problematic, because we want the alpha to be low while the p value is high, and vice verse which can't be done without a workaround
nbar.m$signif.<-rescale(as.numeric(nbar.m$signif.),to=c(0.1,0.9)) # I tried to use to=c(0.1,0.9) argument as you might expect, but to avoid problems with the next step of reciprocal values when dividing over one, this is needed for the alpha aesthetics as a workaround
nbar.m$signif.<-as.factor(0.09/nbar.m$signif.) # the alpha now behaves as wanted  except for the NAs values stil show as if with three stars level, how to fix that?
# Adding the alpha aesthetics in geom_point in a shape of squares (you can improve here)
pp<- pa +
geom_point(data=nbar.m,aes(alpha=signif.),shape=22,size=5,colour="darkgreen",na.rm=TRUE,legend=FALSE) # you can remove this step, the result of this step is seen in one of the layers in the above green heatmap, the shape used is 22 which is again a square but the size you can play with it accordingly  

希望这能成为向前迈进的一步!请注意:
- 有人建议以不同方式分类或切割R^2,当然可以这样做,但我们仍然希望以图形方式向观众展示显著性水平,而不是用星级来困扰他们的眼睛。原则上我们能实现这一点吗?
- 有人建议以不同方式切割p值,如果在不困扰眼睛的情况下无法显示三个显著性水平,那么这可能是一个选择。然后最好显示显著/非显著而不带有水平。
- 对于上述在ggplot2中关于alpha和size美学的解决方法,你可能会想到更好的主意,希望尽快听到你的回复!
- 这个问题还没有得到答案,期待创新的解决方案!
- 有趣的是,“corrplot”包可以做到!我通过这个包得到了下面的图形,附注:交叉的方块表示不显著,显著水平为0.05。但我们如何将其转换为ggplot2呢,我们能做到吗?

enter image description here

-或者你可以做圆圈并隐藏那些非重要的内容?如何在ggplot2中实现这个?

1
精美的图表,但是人们应该意识到,在相关系数热力图的这种情况下,p值可能并不意味着他们期望的(或任何)东西。如果你有这么多可能的相关性($n^2-n/2$),即使是>.99的p值也开始变得有些可能了。 在这种情况下过度依赖p值可能被认为是另一种名为“p-hacking”的行为。这个xkcd非常好地解释了它。 - maxheld
opts() 已经很久以前被弃用。请使用 theme() 选项。 - Shicheng Guo
@ShichengGuo 好的,已经编辑好了,谢谢。 - doctorate

2
library("corrplot")
nba <- as.matrix(read.csv("https://raw.githubusercontent.com/Shicheng-Guo/Shicheng-Guo.Github.io/master/data/ppg2008.csv")[-1])
res1 <- cor.mtest(nba, conf.level = .95)
par(mfrow=c(2,2))

# correlation and P-value
corrplot(cor(nba), p.mat = res1$p, insig = "label_sig",sig.level = c(.001, .01, .05), pch.cex = 0.8, pch.col = "white",tl.cex=0.8)

# correlation and hclust
corrplot(cor(nba), method = "shade", outline = T, addgrid.col = "darkgray", order="hclust", 
         mar = c(4,0,4,0), addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", 
         tl.cex = 0.8, cl.cex = 0.8)

1
为了表示估计相关系数的显著性,您可以通过改变着色量来实现 - 可以使用 alpha 或仅填充每个瓷砖的子集:
# install.packages("fdrtool")
# install.packages("data.table")
library(ggplot2)
library(data.table)

#download dataset
nba <- as.matrix(read.csv("http://datasets.flowingdata.com/ppg2008.csv")[-1])
m <- ncol(nba)
# compute corellation and p.values for all combinations of columns
dt <- CJ(i=seq_len(m), j=seq_len(m))[i<j]
dt[, c("p.value"):=(cor.test(nba[,i],nba[,j])$p.value), by=.(i,j)]
dt[, c("corr"):=(cor(nba[,i],nba[,j])), by=.(i,j)]

# estimate local false discovery rate
dt[,lfdr:=fdrtool::fdrtool(p.value, statistic="pvalue")$lfdr]

dt <- rbind(dt, setnames(copy(dt),c("i","j"),c("j","i")), data.table(i=seq_len(m),j=seq_len(m), corr=1, p.value=0, lfdr=0))


#use alpha
ggplot(dt, aes(x=i,y=j, fill=corr, alpha=1-lfdr)) + 
  geom_tile()+
  scale_fill_distiller(palette = "RdYlGn", direction=1, limits=c(-1,1),name="Correlation") +
  scale_x_continuous("variable", breaks = seq_len(m), labels = colnames(nba)) +
  scale_y_continuous("variable", breaks = seq_len(m), labels = colnames(nba), trans="reverse") +
  coord_fixed() +
  theme(axis.text.x=element_text(angle=90, vjust=0.5),
        panel.background=element_blank(),
        panel.grid.minor=element_blank(),
        panel.grid.major=element_blank(),
  )

alpha

#use area
ggplot(dt, aes(x=i,y=j, fill=corr,  height=sqrt(1-lfdr),  width=sqrt(1-lfdr))) + 
  geom_tile()+
  scale_fill_distiller(palette = "RdYlGn", direction=1, limits=c(-1,1),name="Correlation") +
  scale_color_distiller(palette = "RdYlGn", direction=1, limits=c(-1,1),name="Correlation") +
  scale_x_continuous("variable", breaks = seq_len(m), labels = colnames(nba)) +
  scale_y_continuous("variable", breaks = seq_len(m), labels = colnames(nba), trans="reverse") +
  coord_fixed() +
  theme(axis.text.x=element_text(angle=90, vjust=0.5),
        panel.background=element_blank(),
        panel.grid.minor=element_blank(),
        panel.grid.major=element_blank(),
  )

area

关键在于对p值进行缩放:为了获得易于解释的值,只在相关区域显示大的变化,我使用fdrtools提供的局部虚假发现(lfdr)的上限估计。

也就是说,一个图块的α值可能小于或等于该相关性与0不同的概率。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接