如何在Python StatsModel中删除无关紧要的分类交互项

Question

如何在Python StatsModel中删除无关紧要的分类交互项

pythonmachine-learninglinear-regressionstatsmodels

4

在统计模型中，添加交互项很容易。然而，并不是所有的交互都是显著的。我的问题是如何去除那些不显著的交互？例如Kootenay机场。

# -*- coding: utf-8 -*-
import pandas as pd
import statsmodels.formula.api as sm


if __name__ == "__main__":

    # Read data
    census_subdivision_without_lower_mainland_and_van_island = pd.read_csv('../data/augmented/census_subdivision_without_lower_mainland_and_van_island.csv')

    # Fit all data
    fit = sm.ols(formula="instagram_posts ~ airports * C(CNMCRGNNM) + ports_and_ferry_terminals + railway_stations + accommodations + visitor_centers + festivals + attractions + C(CNMCRGNNM) + C(CNSSSBDVS3)", data=census_subdivision_without_lower_mainland_and_van_island).fit()
    print(fit.summary())

- ZHU

如果您能提供数据样本以及理想输出的描述，那将非常有帮助。 - vander

2个回答

1

我尝试重新创建一些数据，重点关注交互中的变量。我不确定目标是否仅是获取值，还是需要特定格式，但这是使用pandas解决问题的示例（因为您在原始帖子中导入了pandas）：

import pandas as pd
import statsmodels.formula.api as sm
np.random.seed(2)

df = pd.DataFrame()
df['instagram_posts'] = np.random.rand(50)
df['airports'] = np.random.rand(50)
df['CNMCRGNNM'] = np.random.choice(['Kootenay','Nechako','North Coast','Northeast','Thompson-Okanagan'],50)

fit = sm.ols(formula="instagram_posts ~ airports * C(CNMCRGNNM)",data=df).fit()
print(fit.summary())

这是输出内容：

==============================================================================================================
                                                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
Intercept                                      0.4594      0.159      2.885      0.006       0.138       0.781
C(CNMCRGNNM)[T.Nechako]                       -0.2082      0.195     -1.067      0.292      -0.602       0.186
C(CNMCRGNNM)[T.North Coast]                   -0.1268      0.360     -0.352      0.726      -0.854       0.601
C(CNMCRGNNM)[T.Northeast]                      0.0930      0.199      0.468      0.642      -0.309       0.495
C(CNMCRGNNM)[T.Thompson-Okanagan]              0.1439      0.245      0.588      0.560      -0.351       0.638
airports                                      -0.1616      0.277     -0.583      0.563      -0.722       0.398
airports:C(CNMCRGNNM)[T.Nechako]               0.7870      0.343      2.297      0.027       0.094       1.480
airports:C(CNMCRGNNM)[T.North Coast]           0.3008      0.788      0.382      0.705      -1.291       1.893
airports:C(CNMCRGNNM)[T.Northeast]            -0.0104      0.348     -0.030      0.976      -0.713       0.693
airports:C(CNMCRGNNM)[T.Thompson-Okanagan]    -0.0311      0.432     -0.072      0.943      -0.904       0.842

将 alpha 更改为您偏爱的显著性水平：

alpha = 0.05
df = pd.DataFrame(data = [x for x in fit.summary().tables[1].data[1:] if float(x[4]) < alpha], columns = fit.summary().tables[1].data[0])

数据框架 df 包含原始表格中对于 alpha 有显著意义的记录。在本例中，它是 Intercept 和 airports:C(CNMCRGNNM)[T.Nechako]。

- vander

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Muriel · Accepted Answer

你可能需要考虑逐一删除功能（从最不重要的开始）。这是因为一个功能的重要性取决于另一个功能是否存在或缺失。下面的代码将为您执行此操作（假设您已经定义了X和y）：

import operator
import statsmodels.api as sm
import pandas as pd

def remove_most_insignificant(df, results):
    # use operator to find the key which belongs to the maximum value in the dictionary:
    max_p_value = max(results.pvalues.iteritems(), key=operator.itemgetter(1))[0]
    # this is the feature you want to drop:
    df.drop(columns = max_p_value, inplace = True)
    return df

insignificant_feature = True
while insignificant_feature:
        model = sm.OLS(y, X)
        results = model.fit()
        significant = [p_value < 0.05 for p_value in results.pvalues]
        if all(significant):
            insignificant_feature = False
        else:
            if X.shape[1] == 1:  # if there's only one insignificant variable left
                print('No significant features found')
                results = None
                insignificant_feature = False
            else:            
                X = remove_most_insignificant(X, results)
print(results.summary())