NumPy中的加权标准差

119

numpy.average()函数有一个权重选项,但是numpy.std()没有。是否有人有解决方法?


2
顺便说一下,加权标准差的计算实际上是一个相当复杂的主题——有不止一种方法可以做到这一点。请参见此处的精彩讨论:https://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/ - JohnE
http://www.ccgalberta.com/pygeostat/statistics.html#weighted-statistics - e271p314
7个回答

187
以下是一个简短的“手动计算”示例,你觉得怎么样?
def weighted_avg_and_std(values, weights):
    """
    Return the weighted average and standard deviation.

    They weights are in effect first normalized so that they 
    sum to 1 (and so they must not all be 0).

    values, weights -- NumPy ndarrays with the same shape.
    """
    average = numpy.average(values, weights=weights)
    # Fast and numerically precise:
    variance = numpy.average((values-average)**2, weights=weights)
    return (average, math.sqrt(variance))

    

6
为什么不再使用 numpy.average 来计算方差? - user2357112
5
想指出这会导致偏差方差。对于小样本大小,您可能需要重新调整方差(在取平方根之前)以获得无偏方差。请参见https://en.wikipedia.org/wiki/Weighted_variance#Weighted_sample_variance - Corey
1
是的,无偏方差估计器会稍有不同。这个答案给出了标准差,因为问题要求 numpy.std() 的加权版本。 - Eric O. Lebigot
2
np.sqrt() 可以工作,但由于 variance 是一个简单的(Numpy)浮点数(而不是NumPy数组),因此 math.sqrt() 更加明确和适当(如果这很重要,则通常更快)。 - Eric O. Lebigot
1
我本想反驳@EricO.Lebigot,但事实确实令人惊讶: %timeit math.sqrt(1.5)每次循环61.6纳秒±0.661纳秒,%timeit np.sqrt(1.5) 每次循环554纳秒±14.5纳秒。 math.sqrt 几乎比 numpy 快了 10 倍(Python 3.10.9、numpy v1.23.3)。 - olq_plo
显示剩余3条评论

54

statsmodels有一个类可以轻松计算加权统计数据:statsmodels.stats.weightstats.DescrStatsW

假设这个数据集和权重如下:

import numpy as np
from statsmodels.stats.weightstats import DescrStatsW

array = np.array([1,2,1,2,1,2,1,3])
weights = np.ones_like(array)
weights[3] = 100

你需要初始化这个类(注意,你必须在此时传入校正因子和自由度):

weighted_stats = DescrStatsW(array, weights=weights, ddof=0)

然后你可以计算:

  • .mean 权重平均值:

>>> weighted_stats.mean      
1.97196261682243
  • .std 表示加权标准差

  • >>> weighted_stats.std       
    0.21434289609681711
    
  • .var 代表加权方差

  • >>> weighted_stats.var       
    0.045942877107170932
    
  • .std_mean 计算加权均值的 标准误差

  • >>> weighted_stats.std_mean  
    0.020818822467555047
    

    如果你对标准误差和标准差之间的关系感兴趣,那么请注意:标准误差(当ddof == 0时)被计算为加权标准差除以权重总和减1的平方根。(在 GitHub 上 statsmodels 版本0.9 的相应源代码)。

    standard_error = standard_deviation / sqrt(sum(weights) - 1)
    

    要使用这种方法轻松计算加权变异系数,请参见此答案 - Asclepius

    42

    这里还有一个选项:

    np.sqrt(np.cov(values, aweights=weights))
    

    这是所给解决方案中最短、最优雅的。 - Itamar Mushkin
    1
    这确实很干净。值得注意的是,这个解决方案仅限于1D权重和值,因此比被接受的答案更不通用。如果使用ddof=0调用np.cov,则两个计算在1D数据上是一致的。 - Eric O. Lebigot

    6

    目前在numpy/scipy中似乎还没有这样的功能,但有一个票据提出了这个额外的功能。其中包括Statistics.py,它实现了加权标准差。


    2

    有一个非常好的例子是由gaborous提出的:

    import pandas as pd
    import numpy as np
    # X is the dataset, as a Pandas' DataFrame
    # Compute the weighted sample mean (fast, efficient and precise)
    mean = mean = np.ma.average(X, axis=0, weights=weights) 
    
    # Convert to a Pandas' Series (it's just aesthetic and more 
    # ergonomic; no difference in computed values)
    mean = pd.Series(mean, index=list(X.keys())) 
    xm = X-mean # xm = X diff to mean
    # fill NaN with 0 
    # a variance of 0 is just void, but at least it keeps the other
    # covariance's values computed correctly))
    xm = xm.fillna(0) 
    # Compute the unbiased weighted sample covariance
    sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); 
    

    加权无偏样本协方差的正确公式,URL(版本:2016-06-28)


    最后一行计算的来源可能需要一些解释。 - Paidoo

    1

    这是对“样本”或“无偏”标准差的后续讨论,涉及到“频率权重”。因为“加权样本标准差python”在谷歌搜索中会导向这篇文章。

    def frequency_sample_std_dev(X, n):
        """
        Sample standard deviation for X and n,
        where X[i] is the quantity each person in group i has,
        and n[i] is the number of people in group i.
        See Equation 6.4 of:
        Montgomery, Douglas, C. and George C. Runger. Applied Statistics 
         and Probability for Engineers, Enhanced eText. Available from: 
          WileyPLUS, (7th Edition). Wiley Global Education US, 2018.
        """
        n_groups = len(n)
        n_people = sum(n)
        lhs_numerator = sum([ni*Xi**2 for Xi, ni in zip(X, n)])
        rhs_numerator = sum([Xi*ni for Xi, ni in zip(X,n)])**2/n_people
        denominator = n_people-1
        var = (lhs_numerator - rhs_numerator) / denominator
        std = sqrt(var)
        return std
    

    或者按照@Eric的回答进行修改:

    def weighted_sample_avg_std(values, weights):
        """
        Return the weighted average and weighted sample standard deviation.
    
        values, weights -- Numpy ndarrays with the same shape.
        
        Assumes that weights contains only integers (e.g. how many samples in each group).
        
        See also https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights
        """
        average = np.average(values, weights=weights)
        variance = np.average((values-average)**2, weights=weights)
        variance = variance*sum(weights)/(sum(weights)-1)
        return (average, sqrt(variance))
    
    print(weighted_sample_avg_std(X, n))
    

    1
    谢谢你的好回答!不过,针对你的第二个函数weighted_sample_avg_std(),在第三行中你提到方差公式的第二部分时,方差不应该乘以总和比率,而是应该乘以非零权重数的比率(https://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf)。 - DouglasCoenen
    嗯...那是个好观点。你介意提出一个修改建议吗?在你发表评论之后,我曾经研究过这个问题,但实际的更改对我来说并不明显。 - Sterling

    0

    我刚刚在寻找一个API等同于numpy的 np.std 功能,同时也允许设置 axis 参数:

    (我只在两个维度上进行了测试,如果有任何错误,请随时改进。)

    def std(values, weights=None, axis=None):
        """
        Return the weighted standard deviation.
        axis -- the axis for std calculation
        values, weights -- Numpy ndarrays with the same shape on the according axis.
        """
        average = np.expand_dims(np.average(values, weights=weights, axis=axis), axis=axis)
        # Fast and numerically precise:
        variance = np.average((values-average)**2, weights=weights, axis=axis)
        return np.sqrt(variance)
    

    感谢Eric O Lebigot提供原始答案


    网页内容由stack overflow 提供, 点击上面的
    可以查看英文原文,
    原文链接