使用Python卡方拟合优度检验获取最佳分布

Question

使用Python卡方拟合优度检验获取最佳分布

5

在给定一组数据值的情况下，我试图得到最好的理论分布来描述数据。经过数天的研究，我想出了以下Python代码。

import numpy as np
import csv
import pandas as pd
import scipy.stats as st
import math
import sys
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def fit_to_all_distributions(data):
    dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']

    params = {}
    for dist_name in dist_names:
        try:
            dist = getattr(st, dist_name)
            param = dist.fit(data)

            params[dist_name] = param
        except Exception:
            print("Error occurred in fitting")
            params[dist_name] = "Error"

    return params 


def get_best_distribution_using_chisquared_test(data, params):

    histo, bin_edges = np.histogram(data, bins='auto', normed=False)
    number_of_bins = len(bin_edges) - 1
    observed_values = histo

    dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']

    dist_results = []

    for dist_name in dist_names:

        param = params[dist_name]
        if (param != "Error"):
            # Applying the SSE test
            arg = param[:-2]
            loc = param[-2]
            scale = param[-1]
            cdf = getattr(st, dist_name).cdf(bin_edges, loc=loc, scale=scale, *arg)
            expected_values = len(data) * np.diff(cdf)
            c , p = st.chisquare(observed_values, expected_values, ddof=number_of_bins-len(param))
            dist_results.append([dist_name, c, p])


    # select the best fitted distribution
    best_dist, best_c, best_p = None, sys.maxsize, 0

    for item in dist_results:
        name = item[0]
        c = item[1]
        p = item[2]
        if (not math.isnan(c)):
            if (c < best_c):
                best_c = c
                best_dist = name
                best_p = p

    # print the name of the best fit and its p value

    print("Best fitting distribution: " + str(best_dist))
    print("Best c value: " + str(best_c))
    print("Best p value: " + str(best_p))
    print("Parameters for the best fit: " + str(params[best_dist]))

    return best_dist, best_c, params[best_dist], dist_results

然后我通过以下方式测试此代码：

a, m = 3., 2.
values = (np.random.pareto(a, 1000) + 1) * m
data = pd.Series(values)
params = fit_to_all_distributions(data)
best_dist_chi, best_chi, params_chi, dist_results_chi = get_best_distribution_using_chisquared_test(values, params)

由于数据点是使用帕累托分布生成的，因此应该返回Pareto作为最佳拟合分布，并且具有足够大的p值（p> 0.05）。

但这是我得到的输出结果。

Best fitting distribution: genextreme
Best c value: 106.46087793622216
Best p value: 7.626303538461713e-24
Parameters for the best fit: (-0.7664124294696955, 2.3217378846757164, 0.3711562696710188)

我的Chi Squared拟合优度检验实现是否有任何问题？

- Pasindu Tennage

你的代码中的 st 是什么？ - Joe

scipy.stats 模块 - Pasindu Tennage

传递给cdf函数的参数可以简化为cdf（bin_edges，* param），就像其他scipy统计函数ppf，pdf等的情况一样。 - Kevin Zhu

2个回答

1

你用于生成随机数的Pareto函数与你用于拟合数据的函数不同。

第一个函数来自numpy，他们说明：

从指定形状的Pareto II或Lomax分布中抽取样本。Lomax或Pareto II分布是一种平移的Pareto分布。通过添加1并乘以比例参数m可以从Lomax分布获得经典的Pareto分布。

你用于拟合的pareto function来自Scipy，我猜他们使用了不同的定义：

上述概率密度以“标准化”形式定义。要移动和/或缩放分布，请使用loc和scale参数。

- Joe

如果我使用如下相同的帕累托分布，

b = 2.62
values = st.pareto.rvs(b, size=1000)

它显示出非常小的p值。我的卡方检验实现有什么问题吗？ - Pasindu Tennage

从简单到复杂 :) 请使用正态分布编写一个非常简单的示例，并计算其卡方值，就像您在示例中所做的那样。然后修改您的代码以从正态分布中绘制数字，并查看它是否有效。 - Joe

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pasindu Tennage · Accepted Answer

Python卡方拟合度检验 (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) 提到了“自由度差值”: 用于p值的自由度调整。通过使用k-1-ddof个自由度的卡方分布计算p值，其中k是观测频率的数量。ddof的默认值为0。

因此，您的代码应进行以下更正。

c , p = st.chisquare(observed_values, expected_values, ddof=len(param))