如何在 ggplot2 R 中为零膨胀负二项式函数添加置信区间?

3
我有一个数据集,它被建模为带有混合效应的零膨胀负二项分布。我想从模型预测中获得置信区间,并绘制模型的均值和置信区间。我已经尝试绘制了模型均值,但不知道如何将模型的置信区间绘制到 ggplot2 上。我想绘制数据的预测均值及其置信区间。我的基本图表尝试代码如下:
library(pscl)
library(lmtest)

df <- data.frame(
      fertilizer = c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P"), 
      level = c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low"), 
      repro = c(0,90,2,4,0,80,1,90,2,33,56,0,99,100,66,80,1,0,2,33,0,0,1,2,90,5,2,2,5,8,0,1,90,2,4,66,0,0,0,0,1,2,90,5,2,5,8,55)
    )    

    model <- formula(repro ∼ fertilizer + level | fertilizer * level)
    modelzinb <- zeroinfl(model, dist = "negbin", link = "logit",data =df)
    summary(modelzinb)

    df$predict <- predict(modelzinb)

    ggplot(df, aes(x=fertilizer, y=predict, color = fertilizer)) + theme_bw() +  stat_summary(aes(color = fertilizer),fun.y = mean, geom = "point", size = 4, position = position_dodge(0.1)) +
      scale_x_discrete("Fertlizer") +
      facet_wrap(.~level)  

1
zeroinfl 是哪个包里的?请始终包括您使用过的任何非基础 R 包。 - Maurits Evers
抱歉,我现在已经包含它了。 - Rspacer
1个回答

4

我不确定您想绘制什么。但是我可以向您展示如何计算零膨胀负二项模型的预测置信区间。

请注意,我无法评论拟合的质量或拟合这种模型是否有意义。要回答这些问题需要具备领域特定的知识,而我没有。

  1. Let's start by fitting the model to your data.

    library(pscl)
    model <- formula(repro ~ fertilizer + level | fertilizer * level)
    modelzinb <- zeroinfl(model, dist = "negbin", link = "logit", data = df)
    
  2. We can use predict to get model estimates for the response.

    resp <- predict(modelzinb)
    

    Note that in your zero-inflated NB model, predict.zeroinfl (by default) returns the estimated mean response on the scale of the observed counts repro.

    Concerning confidence intervals (CIs), the "issue" here is that precict.zeroinfl does not have an interval argument that allows to calculate CIs directly. On a side note: It seems that pscl::zeroinfl used to include that functionality, see the documentation for version 0.54. Perhaps it would be worth contacting the package maintainer(s) about this.

    The bottom line is, we have to find a different way to calculate prediction CIs. To do so, we use bootstrapping. The R library boot provides all the necessary functions & tools to do that.

  3. To start, we can use the function boot::boot to generate bootstrap replicates of the predicted responses.boot::boot needs a function that generates the quantity of interest (here the predicted response) based on data. So we first define such a function stat. In this particular case stat must take two arguments: the first argument is the original data, the second argument is a vector of row indices (that define the random bootstrap sample of the data). The function will then use bootstrapped data to fit the model, and then use the model to predict the mean response based on the full data.

    stat <- function(df, inds) {
        model <- formula(repro ~ fertilizer + level | fertilizer * level)
        predict(
            zeroinfl(model, dist = "negbin", link = "logit", data = df[inds, ]),
            newdata = df)
    }
    

    In order to ensure reproducibility of results, we set a a fixed random seed and generate 100 bootstrap samples for the estimated mean response

    set.seed(2018)
    library(boot)
    res <- boot(df, stat, R = 100)
    

    The output object res is a list that contains the estimated mean response for the full data in element t0 (verify with all.equal(res$t0, predict(modelzinb))) and the estimated mean response for the bootstrap samples in element t (which is a matrix of dimension R x nrow(df), where R are the number of bootstrap samples).

  4. All that's left to do is to calculate a confidence interval from the estimated mean responses based on the models fitted to the bootstrap samples. To do so we can use the handy function boot::boot.ci. The function allows to calculate different types of CIs (basic, normal, BCa, etc.). Here we use "basic" for demonstration purposes. I do not claim that this is the best choice.

    boot::boot.ci takes an index argument which corresponds to the entry of the estimated mean response vector. The actual intervals are stored in the last 2 columns (column 4 and 5) of the matrix stored as a list element of the boot.ci return object.

    CI <- setNames(as.data.frame(t(sapply(1:nrow(df), function(row)
        boot.ci(res, conf = 0.95, type = "basic", index = row)$basic[, 4:5]))),
        c("lower", "upper"))
    
  5. Now we're done, and we can column-bind the CIs to the original data including predicted mean response values.

    df_all <- cbind.data.frame(df, response = predict(modelzinb), CI)
    #   fertilizer level repro response      lower    upper
    #1           N   low     0 29.72057   5.876731 46.48165
    #2           N   low    90 29.72057   5.876731 46.48165
    #3           N  high     2 21.99345 -15.228956 38.86421
    #4           N  high     4 21.99345 -15.228956 38.86421
    #5           N   low     0 29.72057   5.876731 46.48165
    #6           N   low    80 29.72057   5.876731 46.48165
    #7           N  high     1 21.99345 -15.228956 38.86421
    #8           N  high    90 21.99345 -15.228956 38.86421
    #9           N   low     2 29.72057   5.876731 46.48165
    #10          N   low    33 29.72057   5.876731 46.48165
    #11          N  high    56 21.99345 -15.228956 38.86421
    #12          N  high     0 21.99345 -15.228956 38.86421
    #13          P   low    99 24.22626  -9.225827 46.17656
    #14          P   low   100 24.22626  -9.225827 46.17656
    #15          P  high    66 22.71826   2.595246 39.88333
    #16          P  high    80 22.71826   2.595246 39.88333
    #17          P   low     1 24.22626  -9.225827 46.17656
    #18          P   low     0 24.22626  -9.225827 46.17656
    #19          P  high     2 22.71826   2.595246 39.88333
    #20          P  high    33 22.71826   2.595246 39.88333
    #21          P   low     0 24.22626  -9.225827 46.17656
    #22          P   low     0 24.22626  -9.225827 46.17656
    #23          P  high     1 22.71826   2.595246 39.88333
    #24          P  high     2 22.71826   2.595246 39.88333
    #25          N   low    90 29.72057   5.876731 46.48165
    #26          N   low     5 29.72057   5.876731 46.48165
    #27          N  high     2 21.99345 -15.228956 38.86421
    #28          N  high     2 21.99345 -15.228956 38.86421
    #29          N   low     5 29.72057   5.876731 46.48165
    #30          N   low     8 29.72057   5.876731 46.48165
    #31          N  high     0 21.99345 -15.228956 38.86421
    #32          N  high     1 21.99345 -15.228956 38.86421
    #33          N   low    90 29.72057   5.876731 46.48165
    #34          N   low     2 29.72057   5.876731 46.48165
    #35          N  high     4 21.99345 -15.228956 38.86421
    #36          N  high    66 21.99345 -15.228956 38.86421
    #37          P   low     0 24.22626  -9.225827 46.17656
    #38          P   low     0 24.22626  -9.225827 46.17656
    #39          P  high     0 22.71826   2.595246 39.88333
    #40          P  high     0 22.71826   2.595246 39.88333
    #41          P   low     1 24.22626  -9.225827 46.17656
    #42          P   low     2 24.22626  -9.225827 46.17656
    #43          P  high    90 22.71826   2.595246 39.88333
    #44          P  high     5 22.71826   2.595246 39.88333
    #45          P   low     2 24.22626  -9.225827 46.17656
    #46          P   low     5 24.22626  -9.225827 46.17656
    #47          P  high     8 22.71826   2.595246 39.88333
    #48          P   low    55 24.22626  -9.225827 46.17656
    #...
    

示例数据

df <- data.frame(
    fertilizer = c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P"),
    level = c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low"),
    repro = c(0,90,2,4,0,80,1,90,2,33,56,0,99,100,66,80,1,0,2,33,0,0,1,2,90,5,2,2,5,8,0,1,90,2,4,66,0,0,0,0,1,2,90,5,2,5,8,55))

谢谢!这可能很琐碎,但是您已经为每个点计算了置信区间。我想要预测均值和使用ggplot2绘图的CI。我是否需要对CI进行平均以获得预测均值的CI? - Rspacer
@Biotechgeek 我计算了预测平均响应的 置信区间(而不是“每个点都有置信区间”,无论那是什么;-)。这正是自助法的整个重点。我不明白你试图绘制什么。fertilizer 是一个分类变量,而你在原始帖子中展示的没有置信区间的图形对我来说没有太多意义。但可能是因为我不知道你所测量的背景、你为什么要建模以及你想展示什么。我能做的就是向你展示如何计算 预测置信区间 - Maurits Evers
@Biotechgeek 你是在询问如何计算和绘制参数估计的置信区间吗? - Maurits Evers
我对inds参数不太清楚。从boot函数的文档中可以看到:"...statistic至少需要接受两个参数。传入的第一个参数将始终是原始数据,第二个参数将是定义引导样本的索引、频率或权重的向量。"但是,boot函数实际上如何选择用作inds的内容呢? - spops

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接