如何使用pandas计算所有列之间的相关性并删除高度相关的列？

Question

如何使用pandas计算所有列之间的相关性并删除高度相关的列？

87

在进行机器学习建模之前，建议先移除高度相关的特征列。如何计算列间的相关性并删除相关性阈值大于0.8的列或描述符？同时保留缩减后数据集的表头。以下是一个数据集示例：

 GA      PN       PC     MBP      GR     AP   
0.033   6.652   6.681   0.194   0.874   3.177    
0.034   9.039   6.224   0.194   1.137   3.4      
0.035   10.936  10.304  1.015   0.911   4.9      
0.022   10.11   9.603   1.374   0.848   4.566    
0.035   2.963   17.156  0.599   0.823   9.406    
0.033   10.872  10.244  1.015   0.574   4.871     
0.035   21.694  22.389  1.015   0.859   9.259     
0.035   10.936  10.304  1.015   0.911   4.5

请帮忙....

- jax

Feature-Engine具有内置的DropCorrelatedFeatures()转换器，可以为您完成繁重的工作，并且与sklearn兼容。 features_to_drop_属性显示它将删除哪些内容。 - kevin_theinfinityfund

相关：这个答案在pandas中实现了R的findCorrelation函数。它识别相关的列并返回除一个之外的所有标签。这里的现有答案会删除所有相关的列，这意味着会删除太多的列。 - undefined

28个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mehran · Answer 1

我相信这必须以迭代的方式完成：

uncorrelated_features = features.copy()

# Loop until there's nothing to drop
while True:
    # Calculating the correlation matrix for the remaining list of features
    cor = uncorrelated_features.corr().abs()

    # Generating a square matrix with all 1s except for the main axis
    zero_main = np.triu(np.ones(cor.shape), k=1) +
        np.tril(np.ones(cor.shape), k=-1)

    # Using the zero_main matrix to filter out the main axis of the correlation matrix
    except_main = cor.where(zero_main.astype(bool))

    # Calculating some metrics for each column, including the max correlation,
    # mean correlation and the name of the column
    mertics = [(except_main[column].max(), except_main[column].mean(), column) for column in except_main.columns]

    # Sort the list to find the most suitable candidate to drop at index 0
    mertics.sort(key=lambda x: (x[0], x[1]), reverse=True)

    # Check and see if there's anything to drop from the list of features
    if mertics[0][0] > 0.5:
        uncorrelated_features.drop(mertics[0][2], axis=1, inplace=True)
    else:
        break

值得一提的是，您可能希望自定义我对指标列表进行排序的方式和/或我检测是否要删除列的方式。

- suhail · Answer 2

在我的代码中，我需要删除与因变量低相关的列，并且我得到了这段代码。

to_drop = pd.DataFrame(to_drop).fillna(True)
to_drop = list(to_drop[to_drop['SalePrice'] <.4 ].index)
df_h1.drop(to_drop,axis=1)

df_h1是我的数据框，SalePrice是因变量...我认为改变这个值可能适用于所有其他问题

- Muhammad Waseem · Answer 3

可以使用statsmodels的variance_inflation_factor函数来检测数据框中的多重共线性。

from statsmodels.stats.outliers_influence import variance_inflation_factor

def vif(X):
    vif = pd.DataFrame()
    vif['Variables'] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif

其中 X 是 DataFrame。涉及到多重共线性的列的 VIF 值将大于 10。对于可以通过其他可用列的线性组合完美再现的列，它的 VIF 值将是无穷大。因此，现在逐一删除列，直到所有无穷大值和更高的 VIF 值被删除。

- amin sharifi · Answer 4

我用自己的方式编写代码，而不使用任何for循环来从pandas数据框中删除高协方差数据。

#get co variance of data
coVar = df.corr() # or df.corr().abs()
threshold = 0.5 # 
"""
1. .where(coVar != 1.0) set NaN where col and index is 1
2. .where(coVar >= threshold) if not greater than threshold set Nan
3. .fillna(0) Fill NaN with 0
4. .sum() convert data frame to serise with sum() and just where is co var greater than threshold sum it
5. > 0 convert all Series to Boolean
"""

coVarCols = coVar.where(coVar != 1.0).where(coVar >=threshold).fillna(0).sum() > 0

# Not Boolean Becuase we need to delete where is co var greater than threshold 
coVarCols = ~coVarCols

# get where you want
df[coVarCols[coVarCols].index]

我希望这可以帮助您使用自己的pandas函数来处理任何for循环，这可以帮助提高您在大型数据集上的速度。

- Chandan · Answer 5

correlatedColumns = []
corr = df.corr()
indices = corr.index
columns = corr.columns
posthreshold = 0.7
negthreshold = -0.7

for c in columns:
    for r in indices:
        if c != r and (corr[c][r] > posthreshold or corr[c][r] < negthreshold):
            correlatedColumns.append({"column" : c , "row" : r , "val" :corr[c][r] })
            

print(correlatedColumns)

- Celso · Answer 6

这是我上个月在工作中使用的方法。也许不是最好或最快的方式，但它可以很好地工作。在这里，df是我的原始Pandas数据框：

dropvars = []
threshold = 0.95
df_corr = df.corr().stack().reset_index().rename(columns={'level_0': 'Var 1', 'level_1': 'Var 2', 0: 'Corr'})
df_corr = df_corr[(df_corr['Corr'].abs() >= threshold) & (df_corr['Var 1'] != df_corr['Var 2'])]
while len(df_corr) > 0:
    var = df_corr['Var 1'].iloc[0]
    df_corr = df_corr[((df_corr['Var 1'] != var) & (df_corr['Var 2'] != var))]
    dropvars.append(var)
df.drop(columns=dropvars, inplace=True)

我的想法如下：首先，我创建一个包含列Var 1、Var 2和Corr的数据框，在其中仅保留那些相关性高于或等于我的阈值（绝对值）的变量对。然后，我迭代地选择此相关性数据框中的第一个变量（Var 1值），将其添加到dropvar列表中，并删除出现它的所有行，直到我的相关性数据框为空为止。最后，我从原始数据框中删除dropvar列表中的列。

- b-shields · Answer 7

我今天有一个类似的问题，并在这个帖子中找到了答案。这是我最终得出的结果。

def uncorrelated_features(df, threshold=0.7):
    """
    Returns a subset of df columns with Pearson correlations
    below threshold.
    """

    corr = df.corr().abs()
    keep = []
    for i in range(len(corr.iloc[:,0])):
        above = corr.iloc[:i,i]
        if len(keep) > 0: above = above[keep]
        if len(above[above < threshold]) == len(above):
            keep.append(corr.columns.values[i])

    return df[keep]

- Karim Djedidi · Answer 8

你可以使用以下代码：

l=[]
corr_matrix = df.corr().abs()

for ci in corr_matrix.columns: 
    for cj in corr_matrix.columns: 
        if (corr_matrix[ci][cj]>0.8 and ci!=cj):
            l.append(ci)
            
l = np.array(l)
to_drop = np.unique(l)
df.drop(to_drop, axis=1, inplace=True)