glm和ggplot2中的stat_smooth对于逻辑回归的预测值不同

Question

glm和ggplot2中的stat_smooth对于逻辑回归的预测值不同

14

我正在尝试使用ggplot2绘制这个逻辑回归图。

df <- structure(list(y = c(2L, 7L, 776L, 19L, 12L, 26L, 7L, 12L, 8L,
24L, 20L, 16L, 12L, 10L, 23L, 20L, 16L, 12L, 18L, 22L, 23L, 22L,
13L, 7L, 20L, 12L, 13L, 11L, 11L, 14L, 10L, 8L, 10L, 11L, 5L,
5L, 1L, 2L, 1L, 1L, 0L, 0L, 0L), n = c(3L, 7L, 789L, 20L, 14L,
27L, 7L, 13L, 9L, 29L, 22L, 17L, 14L, 11L, 30L, 21L, 19L, 14L,
22L, 29L, 28L, 28L, 19L, 10L, 27L, 22L, 18L, 18L, 14L, 23L, 18L,
12L, 19L, 15L, 13L, 9L, 7L, 3L, 1L, 1L, 1L, 1L, 1L), x = c(18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L,
45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 59L,
62L, 63L, 66L)), .Names = c("y", "n", "x"), class = "data.frame", row.names = c(NA,
-43L))


mod.fit <- glm(formula = y/n ~ x, data = df, weight=n, family = binomial(link = logit),
        na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))
summary(mod.fit)

Pi <- c(0.25, 0.5, 0.75)
LD <- (log(Pi /(1-Pi))-mod.fit$coefficients[1])/mod.fit$coefficients[2]
LD.summary <- data.frame(Pi , LD)
LD.summary


plot(df$x, df$y/df$n, xlab = "x", ylab = "Estimated probability")

lin.pred <- predict(mod.fit)
pi.hat <- exp(lin.pred)/(1 + exp(lin.pred))
lines(df$x, pi.hat, lty = 1, col = "red")


segments(x0 = LD.summary$LD, y0 = -0.1, x1 = LD.summary$LD, y1 = LD.summary$Pi,
         lty=2, col=c("darkblue","darkred","darkgreen"))
segments(x0 = 15, y0 = LD.summary$Pi, x1 = LD.summary$LD, y1 = LD.summary$Pi,
         lty=2, col=c("darkblue","darkred","darkgreen"))
legend("bottomleft", legend=c("LD25", "LD50", "LD75"), lty=2, col=c("darkblue","darkred","darkgreen"), bty="n", cex=0.75)

输入图像描述

这是我使用 ggplot2 的尝试。

library(ggplot2)

p <- ggplot(data = df, aes(x = x, y = y/n)) +
            geom_point() +
            stat_smooth(method = "glm", family = "binomial")

p <- p + geom_segment(aes(
                            x = LD.summary$LD
                          , y = 0
                          , xend = LD.summary$LD
                          , yend = LD.summary$Pi
                         )
                         , colour="red"
                       )

p <- p + geom_segment(aes(
                            x = 0
                          , y = LD.summary$Pi
                          , xend = LD.summary$LD
                          , yend = LD.summary$Pi
                         )
                         , colour="red"
                       )

print(p)

enter image description here

问题

glm和stat_smooth的预测值看起来不同。这两种方法会产生不同的结果，还是我漏了什么。
我的ggplot2图形与基本R图形不完全相同。
如何在ggplot2中使用不同颜色的线段？
如何在ggplot2中添加图例？

非常感谢您的帮助和时间。谢谢

- MYaseen208

你的基本R图片中没有图例（命令没问题），我会更新它以避免混淆。 - mathematical.coffee

@mathematical.coffee: 感谢你的评论。请查看左下角的图例。 - MYaseen208

1

是的，那是因为我更新了图片并包含了图例。 - mathematical.coffee

在赋值语句 Pi <- c(0.25, 0.5, 0.75) 中，为什么将变量称为“Pi”？“Pi”是什么的缩写？同样的问题也适用于“LD”。 - Erdogan CEVHER

2个回答

6

修改您的LD.summary，添加一个新列，标签为group（或适当的标签）。

LD.summary$group <- c('LD25','LD50','LD75')

然后修改您的geom_segment命令，使其具有col=LD.summary$group（并删除colour="red"），这将以不同的颜色绘制每个线段并添加一个图例：

geom_segment( aes(...,col=LD.summary$group) )

此外，为了避免一直使用LD.summary$xxx，可以将data=LD.summary输入到geom_segment中：

geom_segment(data=LD.summary, aes(x=0, y=Pi,xend=LD, yend=Pi, colour=group) )

关于为什么这两个图形不完全相同，基础的R图形中x轴从约20开始，而在ggplot中它从零开始。这是因为您的第二个geom_segment从x=0开始。要修复此问题，您可以将x=0更改为x=min(df$x)。

要获取y轴标签，请使用+ scale_y_continuous（'Estimated probability'）。

总之：

LD.summary$group <- c('LD25','LD50','LD75')
p <- ggplot(data = df, aes(x = x, y = y/n)) +
            geom_point() +
            stat_smooth(method = "glm", family = "binomial") +
            scale_y_continuous('Estimated probability')    # <-- add y label
p <- p + geom_segment(data=LD.summary, aes( # <-- data=Ld.summary
                            x = LD
                          , y = 0
                          , xend = LD
                          , yend = Pi
                          , col = group     # <- colours
                         )
                       )    
p <- p + geom_segment(data=LD.summary, aes( # <-- data=Ld.summary
                            x = min(df$x)   # <-- don't plot all the way to x=0
                          , y = Pi
                          , xend = LD
                          , yend = Pi
                          , col = group     # <- colours
                         )
                       )
print(p)

这将产生以下结果：

enter image description here

- mathematical.coffee

@mathematical.cofee：感谢您的优雅回答。一个观察点：为什么LD25、LD50没有像在基础R图中那样与预测线接触？有什么想法吗？谢谢。 - MYaseen208

@MYaseen208 这与 stat_smooth 有关，它生成的数字与您的 pi.hat 公式不同：尝试绘制第一个 p，然后执行 lines(x,pi.hat,lty=1,col='red')，看看我的意思是什么。很遗憾，我对统计学知之甚少，无法帮助您（即您的 pi.hat 计算是否错误或 stat_smooth 是否进行了您不知道的其他计算）。我唯一能建议的就是查看 stat_smooth 的在线帮助，并查看它是否提供了有关如何计算平滑器的任何信息。http://had.co.nz/ggplot2/stat_smooth.html - mathematical.coffee

虽然我相信调整现有答案应该很容易，但在当前形式下它并没有回答问题。也就是说，由于线段的端点不在曲线上，因此图形没有被重现。 - mpiktas

6

由于 stat_smooth 没有使用与 mod.fit 的 glm 调用中相同的选项，因此出现了这种情况。特别是，没有传递 weight 选项。尝试在 ggplot 调用的 aes 中添加 weight=n。 - James

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- joran · Accepted Answer

对@mathetmatical.coffee的回答进行了一些小补充。通常，geom_smooth并不代替实际建模，这就是为什么有时候想要使用从glm等特定输出中获得的特定结果时可能会感到不便。但实际上，我们只需要将拟合值添加到数据框中即可：

df$pred <- pi.hat
LD.summary$group <- c('LD25','LD50','LD75')

ggplot(df,aes(x = x, y = y/n)) + 
    geom_point() + 
    geom_line(aes(y = pred),colour = "black") + 
    geom_segment(data=LD.summary, aes(y = Pi,
                                      xend = LD,
                                      yend = Pi,
                                      col = group),x = -Inf,linetype = "dashed") + 
    geom_segment(data=LD.summary,aes(x = LD,
                                     xend = LD,
                                     yend = Pi,
                                     col = group),y = -Inf,linetype = "dashed")

输入图像描述

最后一个小技巧是使用Inf和-Inf，使虚线延伸到绘图边界。

这里的教训是，如果您只想在绘图中添加平滑曲线且绘图中没有其他内容依赖它，那么请使用geom_smooth。如果您想引用拟合模型的输出，则通常更容易在ggplot之外拟合模型，然后进行绘图。