有没有Python函数/库可用于计算二项式置信区间？

Question

有没有Python函数/库可用于计算二项式置信区间？

pythonstatistics

28

我需要在Python脚本中计算大量数据的二项式置信区间。您知道任何可以做到这一点的Python函数或库吗？

理想情况下，我希望像这个http://statpages.org/confint.html链接上实现的那样拥有一个Python函数。

感谢您的时间。

- Geparada

4

你看过Scipy、statsmodels和Pandas吗？（这只是建议，我不知道它们是否真的有你想要的东西。） - Fred Foo

1

这个能帮到你吗？http://math.stackexchange.com/questions/27518/how-to-calculate-a-confidence-interval-for-a-binomial-given-a-specific-prior - favoretti

@favoretti，我之前找到过这篇文章，我相信用R有很多种方法可以做到，但首先我想知道是否有任何方法可以用Python实现。 - Geparada

这篇论文也许会有所帮助... - Andy Hayden

@hayden 我已经有了，谢谢！ - Geparada

在Python中，您想要使用statsmodels。您正在寻找的函数大多可以在此处找到：https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportion_confint.html - cgnorthcutt

9个回答

11

如果可以选择的话，我会说R（或其他统计软件包）可能会更好地为您服务。话虽如此，如果您仅需要二项式置信区间，那么您可能不需要整个库。这是我从JavaScript最朴素的翻译中提供的函数。

def binP(N, p, x1, x2):
    p = float(p)
    q = p/(1-p)
    k = 0.0
    v = 1.0
    s = 0.0
    tot = 0.0

    while(k<=N):
            tot += v
            if(k >= x1 and k <= x2):
                    s += v
            if(tot > 10**30):
                    s = s/10**30
                    tot = tot/10**30
                    v = v/10**30
            k += 1
            v = v*q*(N+1-k)/k
    return s/tot

def calcBin(vx, vN, vCL = 95):
    '''
    Calculate the exact confidence interval for a binomial proportion

    Usage:
    >>> calcBin(13,100)    
    (0.07107391357421874, 0.21204372406005856)
    >>> calcBin(4,7)   
    (0.18405151367187494, 0.9010086059570312)
    ''' 
    vx = float(vx)
    vN = float(vN)
    #Set the confidence bounds
    vTU = (100 - float(vCL))/2
    vTL = vTU

    vP = vx/vN
    if(vx==0):
            dl = 0.0
    else:
            v = vP/2
            vsL = 0
            vsH = vP
            p = vTL/100

            while((vsH-vsL) > 10**-5):
                    if(binP(vN, v, vx, vN) > p):
                            vsH = v
                            v = (vsL+v)/2
                    else:
                            vsL = v
                            v = (v+vsH)/2
            dl = v

    if(vx==vN):
            ul = 1.0
    else:
            v = (1+vP)/2
            vsL =vP
            vsH = 1
            p = vTU/100
            while((vsH-vsL) > 10**-5):
                    if(binP(vN, v, 0, vx) < p):
                            vsH = v
                            v = (vsL+v)/2
                    else:
                            vsL = v
                            v = (v+vsH)/2
            ul = v
    return (dl, ul)

- Curt

非常感谢您... 我仍然没有找到任何Python库来完成这个任务，所以我将使用这段代码或者R。谢谢！ - Geparada

@Kurtis：你最开始使用的是哪个JavaScript库，它提供了这种优秀的功能？ - Ahmed Fasih

我从问题页面上的JavaScript代码中获取了它（即http://statpages.org/confint.html）。它不是整个库，只是该页面上的一个函数。 - Curt

您可以使用这个 Python 包：https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportion_confint.html - cgnorthcutt

4

我对统计学不是专家，但是binomtest已经内置在SciPy中，并且产生与被接受的答案相同的结果：

from scipy.stats import binomtest

binomtest(13, 100).proportion_ci()
Out[11]: ConfidenceInterval(low=0.07107304618545972, high=0.21204067708744978)

binomtest(4, 7).proportion_ci()
Out[25]: ConfidenceInterval(low=0.18405156764007, high=0.9010117215575631)

默认使用Clopper-Pearson精确方法，该方法与Curt的答案相符，为比较提供以下值:

    Usage:
    >>> calcBin(13,100)    
    (0.07107391357421874, 0.21204372406005856)
    >>> calcBin(4,7)   
    (0.18405151367187494, 0.9010086059570312)

它还具有威尔逊方法的选项，带或不带连续性修正，这与TheBamf的astropy答案相匹配：

binomtest(4, 7).proportion_ci(method='wilson')
Out[32]: ConfidenceInterval(low=0.2504583645276572, high=0.8417801447485302)

binom_conf_interval(4, 7, 0.95, interval='wilson')
Out[33]: array([0.25045836, 0.84178014])

这也与R的binom.test和statsmodels.stats.proportion.proportion_confint相匹配，根据cxrodgers' comment的评论：

对于60次试验中的30次成功，使用Klopper-Pearson，R的binom.test和statsmodels.stats.proportion.proportion_confint都给出(.37, .63)。

binomtest(30, 60).proportion_ci(method='exact')
Out[34]: ConfidenceInterval(low=0.3680620319424367, high=0.6319379680575633)

- endolith

3

以下简单介绍了二项分布的精确(Clopper-Pearson)置信区间计算方法。

def binomial_ci(x, n, alpha=0.05):
    #x is number of successes, n is number of trials
    from scipy import stats
    if x==0:
        c1 = 0
    else:
        c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
    if x==n:
        c2=1
    else:
        c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
    return c1, c2

您可以通过以下方式检查代码：

p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5

我使用了二项比例置信区间和正则化不完全Beta函数之间的关系，具体描述请参见这里：https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval。

- Dani

3

虽然scipy.stats模块有一个.interval()方法来计算等尾置信度，但它缺乏类似的方法来计算最高密度区间。下面是一种使用scipy和numpy中的方法来粗略地实现它的方式。

此解决方案还假定您想要使用Beta分布作为先验分布。超参数a和b设置为1，以便默认先验分布是0到1之间的均匀分布。

import numpy
from scipy.stats import beta
from scipy.stats import norm

def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
    """
    Function computes the posterior mode along with the upper and lower bounds of the
    **Highest Posterior Density Region**.

    Parameters
    ----------
    n: number of successes 
    N: sample size 
    pct: the size of the confidence interval (between 0 and 1)
    a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
    b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
    n_pbins: the number of bins to segment the p_range into (Default=1e3)

    Returns
    -------
    A tuple that contains the mode as well as the lower and upper bounds of the interval
    (mode, lower, upper)

    """
    # fixed random variable object for posterior Beta distribution
    rv = beta(n+a, N-n+b)
    # determine the mode and standard deviation of the posterior
    stdev = rv.stats('v')**0.5
    mode = (n+a-1.)/(N+a+b-2.)
    # compute the number of sigma that corresponds to this confidence
    # this is used to set the rough range of possible success probabilities
    n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
    # set the min and max values for success probability 
    max_p = mode + n_sigma * stdev
    if max_p > 1:
        max_p = 1.
    min_p = mode - n_sigma * stdev
    if min_p > 1:
        min_p = 1.
    # make the range of success probabilities
    p_range = numpy.linspace(min_p, max_p, n_pbins+1)
    # construct the probability mass function over the given range
    if mode > 0.5:
        sf = rv.sf(p_range)
        pmf = sf[:-1] - sf[1:]
    else:
        cdf = rv.cdf(p_range)
        pmf = cdf[1:] - cdf[:-1]
    # find the upper and lower bounds of the interval 
    sorted_idxs = numpy.argsort( pmf )[::-1]
    cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
    j = numpy.argmin( numpy.abs(cumsum - pct) )
    upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
    lower = p_range[ (sorted_idxs[:j+1]).min() ]    

    return (mode, lower, upper)

- mtw729

2

如果我运行 binomial_confidence(0,0,0.5,a=2,b=2)（即此处分布），它应该输出围绕 0.5 的对称区间，但它输出了 (0.0, 0.447)。 - Amelio Vazquez-Reina

对于这些参数，它还输出了错误的模式0。正确的模式应该是0.5。 - Amelio Vazquez-Reina

2

感谢您的错误报告。问题出在众数的计算上，它假设至少有一个变量（n、N、a、b）是浮点数。我现在意识到不应该做出这种假设，因此已相应地更改了有问题的代码行。现在 binomial_confidence(0, 0, 0.5, a=2, b=2) 返回 (0.5, 0.32648112494601628, 0.67351887505398356)。 - mtw729

2

感谢@mtw729 - 请澄清一下，您编写的代码是计算高后验密度区域还是中央可信区间？请参见此处的定义：https://dev59.com/n2Eh5IYBdhLWcg3wjEDn。 - Amelio Vazquez-Reina

2

这是计算HPD区域的代码（如我答案顶部的描述所示）。很抱歉函数名称可能不太清晰。 - mtw729

显示剩余2条评论

3

我自己试了一下。如果有帮助，这里是我的解决方案，只需要两行代码就能获得与JS页面等效的结果。这是频率学派的单侧区间，我把输入参数称为二项分布参数theta的MLE（最大似然估计）。即MLE = 成功次数/实验次数。我找到了单侧区间的上限。因此，这里使用的α值是JS页面上上限的两倍。

from scipy.stats import binom
from scipy.optimize import bisect

def binomial_ci( mle, N, alpha=0.05 ):
    """
    One sided confidence interval for a binomial test.

    If after N trials we obtain mle as the proportion of those
    trials that resulted in success, find c such that

    P(k/N < mle; theta = c) = alpha

    where k/N is the proportion of successes in the set of trials,
    and theta is the success probability for each trial. 
    """


    to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
    return bisect(to_minimise,0,1)

为了找到双侧区间，请将(1-alpha/2)和alpha/2作为参数进行调用。

- James Thorniley

1

有趣！不过，使用您的方法我得到了稍微奇怪的结果。对于60次试验中的30次成功，无论是R的binom.test还是statsmodels.stats.proportion.proportion_confint都使用Klopper-Pearson方法给出(.37, .63)的置信区间。而您的方法给出的是(.38, .63)。上限在小数点后第10位相同，但下限非常不同。您有什么想法吗？ - cxrodgers

2

我也需要这样做。我一直在使用R，想学习一种自己解决问题的方法。我不会说这是严格的Pythonic。

文档字符串已经解释了大部分内容。它假设您已经安装了scipy。

def exact_CI(x, N, alpha=0.95):
    """
    Calculate the exact confidence interval of a proportion 
    where there is a wide range in the sample size or the proportion.

    This method avoids the assumption that data are normally distributed. The sample size
    and proportion are desctibed by a beta distribution.

    Parameters
    ----------

    x: the number of cases from which the proportion is calulated as a positive integer.

    N: the sample size as a positive integer.

    alpha : set at 0.95 for 95% confidence intervals.

    Returns
    -------
    The proportion with the lower and upper confidence intervals as a dict.

    """
    from scipy.stats import beta
    x = float(x)
    N = float(N)
    p = round((x/N)*100,2)

    intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
    intervals.insert(0,p)

    result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}

    return result

- John

1

如果您执行exact_CI(7, 7)，例如，上限CI小于100％，这是没有意义的。 - Aaron Silverman

有道理：仅仅因为7次尝试中的7次都成功了，并不意味着你对成功的概率有100%的信心。请记住，exact_CI(7, 7)返回的是95%的置信区间。 - quant_dev

@John 为什么你使用了Beta(1,1)作为先验而不是Beta(1/2,1/2)（Jeffrey's prior）？ - quant_dev

2

使用Wilson分数和正态累积密度函数的近似方法，可以无需使用numpy/scipy计算相同的内容。

import math

def binconf(p, n, c=0.95):
  '''
  Calculate binomial confidence interval based on the number of positive and
  negative events observed.

  Parameters
  ----------
  p: int
      number of positive events observed
  n: int
      number of negative events observed
  c : optional, [0,1]
      confidence percentage. e.g. 0.95 means 95% confident the probability of
      success lies between the 2 returned values

  Returns
  -------
  theta_low  : float
      lower bound on confidence interval
  theta_high : float
      upper bound on confidence interval
  '''
  p, n = float(p), float(n)
  N    = p + n

  if N == 0.0: return (0.0, 1.0)

  p = p / N
  z = normcdfi(1 - 0.5 * (1-c))

  a1 = 1.0 / (1.0 + z * z / N)
  a2 = p + z * z / (2 * N)
  a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))

  return (a1 * (a2 - a3), a1 * (a2 + a3))


def erfi(x):
  """Approximation to inverse error function"""
  a  = 0.147  # MAGIC!!!
  a1 = math.log(1 - x * x)
  a2 = (
    2.0 / (math.pi * a)
    + a1 / 2.0
  )

  return (
    sign(x) *
    math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
  )


def sign(x):
  if x  < 0: return -1
  if x == 0: return  0
  if x  > 0: return  1


def normcdfi(p, mu=0.0, sigma2=1.0):
  """Inverse CDF of normal distribution"""
  if mu == 0.0 and sigma2 == 1.0:
    return math.sqrt(2) * erfi(2 * p - 1)
  else:
    return mu + math.sqrt(sigma2) * normcdfi(p)

- duckworthd

很棒的解决方案！没有依赖项，并实现了非常有用的威尔逊得分。在我的测试中，这里的近似值似乎至少精确到三位小数--完全值得缺少依赖项。 - SMX

1

Astropy提供了这样一个函数（虽然安装和导入astropy可能有点多余）：astropy.stats.binom_conf_interval。

- TheBamf

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Danica · Accepted Answer

仅仅是注意到这里还没有其他地方发布的是，statsmodels.stats.proportion.proportion_confint 可以使用多种方法得到二项式置信区间。但它只支持对称区间。