使用Scipy（Python）将经验分布拟合为理论分布？

Question

使用Scipy（Python）将经验分布拟合为理论分布？

194

介绍：我有一个整数值列表，包含30,000多个整数值，范围从0到47（包括边界），例如[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]，这些值是从某个连续分布中采样得到的。列表中的值不一定按顺序排列，但对于这个问题来说，顺序并不重要。

问题：基于我的分布，我想计算任何给定值的p-value（看到更大值的概率）。例如，如您所见，0的p-value将趋近于1，而较高数字的p-value将趋近于0。

我不知道是否正确，但我认为为了确定概率，需要将数据拟合到最适合描述数据的理论分布中。我假设需要进行某种拟合度量检验来确定最佳模型。

在Python（Scipy或Numpy）中是否有一种实现这种分析的方法？可以提供任何示例吗？

- s_sherly

3

你只有离散的实证数据，但想要连续的分布？我理解得对吗？ - Michael J. Barber

1

这似乎毫无意义。这些数字代表什么？是带有有限精度的测量吗？ - Michael J. Barber

1

Michael，我在之前的问题中已经解释了这些数字代表什么：https://dev59.com/E2w15IYBdhLWcg3wbbRx - s_sherly

6

这是计数数据。它不是连续分布。 - Michael J. Barber

如果您想查看所有分布的外观或了解如何访问它们，请参阅此答案。 - tmthydvnprt

1

请查看此问题的被接受答案 https://dev59.com/BKjka4cB1Zd3GeqPAIil - Ahmad Senousi

13个回答

191

SciPy v1.6.0中已经实现了90多种分布函数。您可以使用它们的fit()方法测试其中一些如何适合您的数据。有关更多详细信息，请查看以下代码：

enter image description here

import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.stats
size = 30000
x = np.arange(size)
y = scipy.int_(np.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    params = dist.fit(y)
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]
    if arg:
        pdf_fitted = dist.pdf(x, *arg, loc=loc, scale=scale) * size
    else:
        pdf_fitted = dist.pdf(x, loc=loc, scale=scale) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

参考文献：

以下是Scipy 0.12.0（VI）中提供的所有分布函数名称列表：

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']

- Saullo G. P. Castro

9

如果在绘制直方图时normed=True，那会怎样呢？你不需要把pdf_fitted乘以size，对吗？ - aloha

4

如果您想查看所有分布的外观或了解如何访问它们，请参阅此答案。 - tmthydvnprt

@SaulloCastro 在 dist.fit 的输出中，param 中的 3 个值代表什么？ - shaifali Gupta

5

要获取分布名称：from scipy.stats._continuous_distns import _distn_names。然后可以对_distn_names中的每个distname使用类似于getattr(scipy.stats, distname)的东西。这很有用，因为分布会随着不同的SciPy版本而更新。 - Brad Solomon

1

@Luigi87 只需使用每个分布的 rvs() 函数，这里在代码中表示为 dist 对象。 - Saullo G. P. Castro

显示剩余8条评论

20

你可以尝试使用 distfit库。如果你有更多问题，请告诉我，我也是这个开源库的开发者。

pip install distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

请注意，由于均匀分布，此情况下所有点均为显著点。如有需要，您可以使用dist.y_pred进行筛选。

更详细的信息和示例可在文档页面中找到。

- erdogant

你是作者吗？ - jtlz2

2

为了清晰起见，我在响应中添加了这个。 - erdogant

这对我很有帮助！我尝试使用它来快速查看我的数据分布，然后使用MLE方法进行确认。顺便问一下，您是否计划发布一个带有MLE方法的扩展？ - Fernando Barraza

你能分享一下你的方法吗？也许我们可以在这里进一步讨论：https://github.com/erdogant/distfit/issues - erdogant

19

fit() 方法由 @Saullo Castro 提供，提供了最大似然估计（MLE）。你的数据的最佳分布可以通过以下几种不同的方式确定：

1. 给出最高对数似然的分布。 2. 给你最小的 AIC、BIC 或 BICc 值的分布（参见维基百科：http://en.wikipedia.org/wiki/Akaike_information_criterion，基本上可以看作是根据参数数量调整后的对数似然，因为参数更多的分布预期拟合效果更好）。 3. 最大化贝叶斯后验概率的分布（参见维基百科：http://en.wikipedia.org/wiki/Posterior_probability）。

当然，如果你已经有一个应该描述你的数据的分布（基于你特定领域的理论），并且想要坚持使用它，那么你将跳过识别最佳拟合分布的步骤。 scipy 并不附带计算对数似然的函数（尽管提供了 MLE 方法），但硬编码一个函数很容易：参见 Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?。

- CT Zhu

2

我该如何将这种方法应用到数据已经分组的情况下——也就是说，数据已经是直方图而不是从数据生成直方图的情况？ - Pete

@pete，这将是一种区间截断数据的情况，有最大似然方法可用，但目前在scipy中尚未实现。 - CT Zhu

不要忘记证据。 - jtlz2

8

据我所知，您的分布是离散的（仅仅是离散的）。因此，只需计算不同值的频率并对其进行归一化即可满足您的要求。以下是一个示例以证明这一点：

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

因此，看到比1更高的值的概率仅仅是（根据补充累积分布函数(ccdf)：

In []: 1- cdf[1]
Out[]: 0.40000000000000002

请注意，ccdf与survival function (sf)密切相关，但它也适用于离散分布，而sf仅适用于连续分布。

- eat

7

以下代码是一般答案的版本，但经过修正和澄清。

import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
import matplotlib as mpl
import matplotlib.pyplot as plt
import math
import random

mpl.style.use("ggplot")

def danoes_formula(data):
    """
    DANOE'S FORMULA
    https://en.wikipedia.org/wiki/Histogram#Doane's_formula
    """
    N = len(data)
    skewness = st.skew(data)
    sigma_g1 = math.sqrt((6*(N-2))/((N+1)*(N+3)))
    num_bins = 1 + math.log(N,2) + math.log(1+abs(skewness)/sigma_g1,2)
    num_bins = round(num_bins)
    return num_bins

def plot_histogram(data, results, n):
    ## n first distribution of the ranking
    N_DISTRIBUTIONS = {k: results[k] for k in list(results)[:n]}

    ## Histogram of data
    plt.figure(figsize=(10, 5))
    plt.hist(data, density=True, ec='white', color=(63/235, 149/235, 170/235))
    plt.title('HISTOGRAM')
    plt.xlabel('Values')
    plt.ylabel('Frequencies')

    ## Plot n distributions
    for distribution, result in N_DISTRIBUTIONS.items():
        # print(i, distribution)
        sse = result[0]
        arg = result[1]
        loc = result[2]
        scale = result[3]
        x_plot = np.linspace(min(data), max(data), 1000)
        y_plot = distribution.pdf(x_plot, loc=loc, scale=scale, *arg)
        plt.plot(x_plot, y_plot, label=str(distribution)[32:-34] + ": " + str(sse)[0:6], color=(random.uniform(0, 1), random.uniform(0, 1), random.uniform(0, 1)))
    
    plt.legend(title='DISTRIBUTIONS', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.show()

def fit_data(data):
    ## st.frechet_r,st.frechet_l: are disbled in current SciPy version
    ## st.levy_stable: a lot of time of estimation parameters
    ALL_DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm, st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]
    
    MY_DISTRIBUTIONS = [st.beta, st.expon, st.norm, st.uniform, st.johnsonsb, st.gennorm, st.gausshyper]

    ## Calculae Histogram
    num_bins = danoes_formula(data)
    frequencies, bin_edges = np.histogram(data, num_bins, density=True)
    central_values = [(bin_edges[i] + bin_edges[i+1])/2 for i in range(len(bin_edges)-1)]

    results = {}
    for distribution in MY_DISTRIBUTIONS:
        ## Get parameters of distribution
        params = distribution.fit(data)
        
        ## Separate parts of parameters
        arg = params[:-2]
        loc = params[-2]
        scale = params[-1]
    
        ## Calculate fitted PDF and error with fit in distribution
        pdf_values = [distribution.pdf(c, loc=loc, scale=scale, *arg) for c in central_values]
        
        ## Calculate SSE (sum of squared estimate of errors)
        sse = np.sum(np.power(frequencies - pdf_values, 2.0))
        
        ## Build results and sort by sse
        results[distribution] = [sse, arg, loc, scale]
        
    results = {k: results[k] for k in sorted(results, key=results.get)}
    return results
        
def main():
    ## Import data
    data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
    results = fit_data(data)
    plot_histogram(data, results, 5)

if __name__ == "__main__":
    main()

- Sebastian Jose

6

虽然上面的回答都是正确的，但似乎没有完全回答你的问题，特别是这部分：“我不知道是否正确，但为了确定概率，我认为需要将数据拟合到最适合描述我的数据的理论分布中。我假设需要进行某种适配度检验以确定最佳模型。”

参数法

这就是你所描述的使用某些理论分布并将其参数拟合到数据中的过程，有一些很好的答案说明如何做到这一点。

非参数法

然而，还可以使用非参数法来解决问题，这意味着你根本不假定任何基础分布。

通过使用所谓的经验分布函数，其等于：Fn(x) = SUM(I[X<=x])/ n。因此，低于x的值的比例。
正如上面的其中一个答案指出的那样，你感兴趣的是反向CDF（累积分布函数），它等于1-F(x) 可以证明，经验分布函数将收敛到生成数据的任何“真实”CDF。
此外，可以通过以下方式简单地构建1-alpha置信区间：
L(X) = max{Fn(x)-en, 0} U(X) = min{Fn(x)+en, 0} en = sqrt( (1/2n)*log(2/alpha)

对于所有的x，使用非参数方法估计F(x)，则有P(L(X) <= F(X) <= U(X)) >=1-alpha。

我很惊讶PierrOz的回答没有得到任何赞同，因为它完全是一个有效的答案，使用了一种非参数方法来估计F(x)。

注意，您提到的对于任何x>47都有P(X>=x)=0的问题只是个人偏好，可能会导致您选择上述参数方法而不是非参数方法。然而，这两种方法都是解决您问题的完全有效的解决方案。

要获取更多细节和上述陈述的证明，建议查看 "All of Statistics: A Concise Course in Statistical Inference by Larry A. Wasserman" 这本关于参数和非参数推断的优秀书籍。

编辑：由于您特别要求一些Python示例，可以使用numpy进行操作：

import numpy as np def empirical_cdf(data, x): return np.sum(x<=data)/len(data) def p_value(data, x): return 1-empirical_cdf(data, x) # Generate some data for demonstration purposes data = np.floor(np.random.uniform(low=0, high=48, size=30000)) print(empirical_cdf(data, 20)) print(p_value(data, 20)) # This is the value you're interested in

- Martin Skogholt

Python代码不是Fn(x)= SUM( I[X<=x] ) / n的相反吗？ - nodesr

5

我发现最简单的方法是使用fitter模块，你只需要运行pip install fitter即可。接下来，你需要通过pandas导入数据集。该模块内置了从scipy搜索所有80种分布并使用各种方法获取最佳拟合数据的函数。例如：

f = Fitter(height, distributions=['gamma','lognorm', "beta","burr","norm"])
f.fit()
f.summary()

在这里，作者提供了一个发行版列表，因为扫描所有80个发行版可能会耗费时间。

f.get_best(method = 'sumsquare_error')

这将为您提供5个最佳分布及其适配标准：

            sumsquare_error    aic          bic       kl_div
chi2             0.000010  1716.234916 -1945.821606     inf
gamma            0.000010  1716.234909 -1945.821606     inf
rayleigh         0.000010  1711.807360 -1945.526026     inf
norm             0.000011  1758.797036 -1934.865211     inf
cauchy           0.000011  1762.735606 -1934.803414     inf

你还可以使用distributions=get_common_distributions()属性，其中包括大约10个最常用的分布，并为您进行适配和检查。

它还有一堆其他功能，比如绘制直方图等，完整文档可以在这里找到。

对于科学家、工程师和普通用户来说，这是一个非常被低估的模块。

- user16116851

4

这对我来说听起来像是概率密度估计问题。

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

同时查看http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html。

- emre

2

对于未来的读者：这个解决方案（或至少是思路）提供了对OP问题（“什么是p值”）最简单的答案 - 很有趣知道这与一些适合已知分布的更复杂方法相比如何。 - Greg

高斯核回归对所有分布都适用吗？ - user7345804

@mikey 一般来说，没有哪个版本适用于所有发行版的回归测试。但这并不代表它们不好。 - TheEnvironmentalist

4

你可以将数据存储在一个字典中，其中键是0到47之间的数字，值是原始列表中与其相关的键出现次数。

因此，概率p(x)将是大于x的所有键的值之和除以30000。

- pierroz

在这种情况下，对于大于47的任何值，p(x)将保持不变（等于0）。我需要一个连续的概率分布。 - s_sherly

2

@s_sherly - 如果您能更好地编辑和澄清您的问题，那将是一件好事，因为正如您所说的“看到更大的值的可能性”- 确实对于高于池中最高值的值来说是零。 - mac

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tmthydvnprt · Accepted Answer

用平方误差和（SSE）进行分布拟合

这是对Saullo的答案的更新和修改，使用当前scipy.stats分布的完整列表，并返回其直方图与数据直方图之间最小SSE的分布。

示例拟合

使用来自statsmodels的El Niño数据集, 拟合分布并确定误差。返回误差最小的分布。

所有分布

最佳拟合分布

示例代码

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Best holders
    best_distributions = []

    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in _distn_names if not d in ['levy_stable', 'studentized_range']]):

        print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

        distribution = getattr(st, distribution)

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')
                
                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                
                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        
        except Exception:
            pass

    
    return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, density=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist[0].shapes + ', loc, scale').split(', ') if best_dist[0].shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = '{}({})'.format(best_dist[0].name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')