Python中的方差膨胀因子

Question

Python中的方差膨胀因子

62

我正在尝试在Python中为简单数据集中的每个列计算方差膨胀因子（VIF）：

我已经使用来自usdm库中的vif函数，在R中完成了这项任务，它给出了以下结果：

a <- c(1, 1, 2, 3, 4)
b <- c(2, 2, 3, 2, 1)
c <- c(4, 6, 7, 8, 9)
d <- c(4, 3, 4, 5, 4)

df <- data.frame(a, b, c, d)
vif_df <- vif(df)
print(vif_df)

Variables   VIF
   a        22.95
   b        3.00
   c        12.95
   d        3.00

然而，当我在Python中使用statsmodel vif function时，我的结果是:

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

ck = np.column_stack([a, b, c, d])

vif = [variance_inflation_factor(ck, i) for i in range(ck.shape[1])]
print(vif)

Variables   VIF
   a        47.136986301369774
   b        28.931506849315081
   c        80.31506849315096
   d        40.438356164383549

尽管输入相同，但结果差异很大。总的来说，statsmodel VIF函数的结果似乎是错误的，但我不确定这是因为我调用方式的问题还是函数本身的问题。

我希望有人能帮助我找出是否我错误地调用了statsmodel函数或解释结果的差异。如果这是函数本身的问题，那么在Python中是否有任何VIF替代方案？

- Nizag

9个回答

37

我认为这是因为Python的OLS计算方法不同所致。在Python方差膨胀因子计算中使用的OLS默认不会添加截距。但是，您肯定需要一个截距。

您需要做的是向矩阵中添加一个名为ck的新列，并将其填充为1以表示常数项。这将成为方程的截距项。完成此操作后，您的值应该正确匹配。

编辑：用1替换0

- Drverzal

将所有变量的平均值减去可能会类似。 - Josef

3

错别字：常数列应该填充1而不是0。 - Josef

我的错别字被你发现了，谢谢提醒。我已经编辑了原帖并修正了错误。 - Drverzal

有道理。添加一列1就解决了问题。谢谢！ - Nizag

26

对于像我这样的未来读者:

import numpy as np
import scipy as sp

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

ck = np.column_stack([a, b, c, d])
cc = sp.corrcoef(ck, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()

这个代码提供了

array([22.95,  3.  , 12.95,  3.  ])

[编辑]

为了回应评论，我尽可能地使用了DataFrame（需要使用numpy来反转矩阵）。

import pandas as pd
import numpy as np

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
df_cor = df.corr()
pd.DataFrame(np.linalg.inv(df.corr().values), index = df_cor.index, columns=df_cor.columns)

这段代码提供了

       a            b           c           d
a   22.950000   6.453681    -16.301917  -6.453681
b   6.453681    3.000000    -4.080441   -2.000000
c   -16.301917  -4.080441   12.950000   4.080441
d   -6.453681   -2.000000   4.080441    3.000000

对角线元素给出VIF。

- T_T

1

请问能否添加一个针对数据框输入的解决方案，而不是仅支持NumPy数组？ - steven

2

看起来不错。只需将VIF作为Series获取：vifs = pd.Series(np.linalg.inv(df.corr().values).diagonal(), index=df_cor.index) - Alexander

vif是逆相关矩阵的对角元素吗？是的，请查看链接：http://documentation.statsoft.com/STATISTICAHelp.aspx?path=glossary/GlossaryTwo/V/VarianceInflationFactorVIF - Shuai Liu

17

如果您不想处理variance_inflation_factor和add_constant，请考虑以下两个函数。

1. 在statasmodels中使用公式：

import pandas as pd
import statsmodels.formula.api as smf

def get_vif(exogs, data):
    '''Return VIF (variance inflation factor) DataFrame

    Args:
    exogs (list): list of exogenous/independent variables
    data (DataFrame): the df storing all variables

    Returns:
    VIF and Tolerance DataFrame for each exogenous variable

    Notes:
    Assume we have a list of exogenous variable [X1, X2, X3, X4].
    To calculate the VIF and Tolerance for each variable, we regress
    each of them against other exogenous variables. For instance, the
    regression model for X3 is defined as:
                        X3 ~ X1 + X2 + X4
    And then we extract the R-squared from the model to calculate:
                    VIF = 1 / (1 - R-squared)
                    Tolerance = 1 - R-squared
    The cutoff to detect multicollinearity:
                    VIF > 10 or Tolerance < 0.1
    '''

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # create formula for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        formula = f"{exog} ~ {' + '.join(not_exog)}"

        # extract r-squared from the fit
        r_squared = smf.ols(formula, data=data).fit().rsquared

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif

2. 使用sklearn中的LinearRegression:

# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif

例子：

import seaborn as sns

df = sns.load_dataset('car_crashes')
exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted']

[In] %%timeit -n 100
get_vif(exogs=exogs, data=df)

[Out]
                      VIF   Tolerance
alcohol          3.436072   0.291030
no_previous      3.113984   0.321132
not_distracted   2.668456   0.374749
speeding         1.884340   0.530690

69.6 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

[In] %%timeit -n 100
sklearn_vif(exogs=exogs, data=df)

[Out]
                      VIF   Tolerance
alcohol          3.436072   0.291030
no_previous      3.113984   0.321132
not_distracted   2.668456   0.374749
speeding         1.884340   0.530690

15.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

- steven

检测多重共线性的截止点： VIF > 10 或容忍度 < 0.1，您必须将容忍度 < 0.2 更改为容忍度 < 0.1。 - Sandra Guerrero

@SandraGuerrero 确实是个打字错误。 - steven

感谢您付出的努力来解释这个问题。非常感谢！ - Sudheer Rao

3

虽然时间已经有点晚了，但我会对给出的答案进行一些修改。如果我们使用@Chef1075的解决方案来去除多重共线性并获得最佳数据集，那么我们将失去相关的变量。我们只需要删除其中一个变量即可。为此，我采用了@steve的答案提供以下解决方案：

import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):
    '''
    This function calculates variance inflation function in sklearn way. 
     It is a comparatively faster process.

    '''
    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4,1],
 'b': [2, 2, 3, 2, 1,3],
 'c': [4, 6, 7, 8, 9,5],
 'd': [4, 3, 4, 5, 4,6],
 'e': [8,8,14,15,17,20]}
  )

df_vif= sklearn_vif(exogs=df.columns, data=df).sort_values(by='VIF',ascending=False)
while (df_vif.VIF>5).any() ==True:
    red_df_vif= df_vif.drop(df_vif.index[0])
    df= df[red_df_vif.index]
    df_vif=sklearn_vif(exogs=df.columns,data=df).sort_values(by='VIF',ascending=False)




print(df)

   d  c  b
0  4  4  2
1  3  6  2
2  4  7  3
3  5  8  2
4  4  9  1
5  6  5  3

- kasraful

那么，在这种情况下，列 d、c 和 b 是不会引起多重共线性的，对吗？ - AlSub

1

@AlvaroMartinez。正确的。 - kasraful

@MdAsrafulKabir，我能问一下你为什么要执行以下操作red_df_vif= df_vif.drop(df_vif.index[0])吗？所以你计算了VIF，将它们从高到低排序；如果最高的大于5，则删除它并重新计算整个过程？ - Ciaran O Brien

2

Boston Data的示例：

VIF是通过辅助回归计算的，因此不依赖于实际拟合结果。

请参见下文：

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# Break into left and right hand side; y and X
y, X = dmatrices(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston, return_type="dataframe")

# For each Xi, calculate VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Fit X to y
result = sm.OLS(y, X).fit()

- s_mj

2

我根据在Stack和CrossValidated上看到的一些帖子编写了这个函数。它显示超过阈值的特征，并返回一个删除了这些特征的新数据框。

from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constant

def calculate_vif_(df, thresh=5):
    '''
    Calculates VIF each feature in a pandas dataframe
    A constant must be added to variance_inflation_factor or the results will be incorrect

    :param df: the pandas dataframe containing only the predictor features, not the response variable
    :param thresh: the max VIF value before the feature is removed from the dataframe
    :return: dataframe with features removed
    '''
    const = add_constant(df)
    cols = const.columns
    variables = np.arange(const.shape[1])
    vif_df = pd.Series([variance_inflation_factor(const.values, i) 
               for i in range(const.shape[1])], 
              index=const.columns).to_frame()

    vif_df = vif_df.sort_values(by=0, ascending=False).rename(columns={0: 'VIF'})
    vif_df = vif_df.drop('const')
    vif_df = vif_df[vif_df['VIF'] > thresh]

    print 'Features above VIF threshold:\n'
    print vif_df[vif_df['VIF'] > thresh]

    col_to_drop = list(vif_df.index)

    for i in col_to_drop:
        print 'Dropping: {}'.format(i)
        df = df.drop(columns=i)

    return df

- Chef1075

7

仅仅删除VIF值高于阈值的所有变量是错误的。正确的方法是删除具有最高VIF的变量，然后重新计算剩余变量的VIF，并重复此步骤，直到没有剩余变量的VIF大于阈值为止。例如，假设x3=x2+x1，则如果我们仅仅删除所有具有较高VIF的变量，将会删除x1/x2/x3中的所有变量，而且我们可能会丢失一个重要变量。 - Huanfa Chen

是的，我同意Huanfa的观点。@chef和其他人 - 如果您只是从初始运行中删除所有高于VIF阈值的列，则会停止比您需要的更多的变量。正如Huanfa所提到的那样，这需要进行迭代处理。 - veg2020

1

这里是使用Python的DataFrame的代码：

创建数据

import numpy as np
import scipy as sp

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

创建DataFrame

import pandas as pd
data = pd.DataFrame()
data["a"] = a
data["b"] = b
data["c"] = c
data["d"] = d

计算VIF

cc = np.corrcoef(data, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()

结果

数组([22.95, 3., 12.95, 3.])

- Max Alonzo

0

另一种解决方案。以下代码给出与R car包完全相同的VIF结果。

def calc_reg_return_vif(X, y):
    """
    Utility function to calculate the VIF. This section calculates the linear
    regression inverse R squared.

    Parameters
    ----------
    X : DataFrame
        Input data.
    y : Series
        Target.

    Returns
    -------
    vif : float
        Calculated VIF value.

    """
    X = X.values
    y = y.values

    if X.shape[1] == 1:
        print("Note, there is only one predictor here")
        X = X.reshape(-1, 1)
    reg = LinearRegression().fit(X, y)
    vif = 1 / (1 - reg.score(X, y))

    return vif


def calc_vif_from_scratch(df):
    """
    Calculating VIF using function from scratch

    Parameters
    ----------
    df : DataFrame
        without target variable.

    Returns
    -------
    vif : DataFrame
        giving the feature - VIF value pair.

    """

    vif = pd.DataFrame()

    vif_list = []
    for feature in list(df.columns):
        y = df[feature]
        X = df.drop(feature, axis="columns")
        vif_list.append(calc_reg_return_vif(X, y))
    vif["feature"] = df.columns
    vif["VIF"] = vif_list
    return vif

我已经在泰坦尼克号数据集上进行了测试。您可以在此处获取完整示例：https://github.com/tulicsgabriel/Variance-Inflation-Factor-VIF-

- b0zg0r

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alexander · Accepted Answer

正如其他人和作者Josef Perktold在此帖子中提到的那样，函数variance_inflation_factor期望在解释变量矩阵中存在一个常数。可以使用statsmodels中的add_constant将所需常数添加到数据框中，然后将其值传递给函数。

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

X = add_constant(df)
>>> pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)
const    136.875
a         22.950
b          3.000
c         12.950
d          3.000
dtype: float64

我相信你也可以使用 assign 将该常量添加到数据帧的最右侧列：

X = df.assign(const=1)
>>> pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)
a         22.950
b          3.000
c         12.950
d          3.000
const    136.875
dtype: float64

源代码本身相当简洁：

def variance_inflation_factor(exog, exog_idx):
    """
    exog : ndarray, (nobs, k_vars)
        design matrix with all explanatory variables, as for example used in
        regression
    exog_idx : int
        index of the exogenous variable in the columns of exog
    """
    k_vars = exog.shape[1]
    x_i = exog[:, exog_idx]
    mask = np.arange(k_vars) != exog_idx
    x_noti = exog[:, mask]
    r_squared_i = OLS(x_i, x_noti).fit().rsquared
    vif = 1. / (1. - r_squared_i)
    return vif

修改代码以返回所有VIF系数序列也非常简单：

from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant

def variance_inflation_factors(exog_df):
    '''
    Parameters
    ----------
    exog_df : dataframe, (nobs, k_vars)
        design matrix with all explanatory variables, as for example used in
        regression.

    Returns
    -------
    vif : Series
        variance inflation factors
    '''
    exog_df = add_constant(exog_df)
    vifs = pd.Series(
        [1 / (1. - OLS(exog_df[col].values, 
                       exog_df.loc[:, exog_df.columns != col].values).fit().rsquared) 
         for col in exog_df],
        index=exog_df.columns,
        name='VIF'
    )
    return vifs

>>> variance_inflation_factors(df)
const    136.875
a         22.950
b          3.000
c         12.950
Name: VIF, dtype: float64

参照@T_T的解决方法，还可以简单地执行以下操作：

vifs = pd.Series(np.linalg.inv(df.corr().to_numpy()).diagonal(), 
                 index=df.columns, 
                 name='VIF')