如何使用pandas计算所有列之间的相关性并删除高度相关的列?

87

在进行机器学习建模之前,建议先移除高度相关的特征列。如何计算列间的相关性并删除相关性阈值大于0.8的列或描述符?同时保留缩减后数据集的表头。以下是一个数据集示例:

 GA      PN       PC     MBP      GR     AP   
0.033   6.652   6.681   0.194   0.874   3.177    
0.034   9.039   6.224   0.194   1.137   3.4      
0.035   10.936  10.304  1.015   0.911   4.9      
0.022   10.11   9.603   1.374   0.848   4.566    
0.035   2.963   17.156  0.599   0.823   9.406    
0.033   10.872  10.244  1.015   0.574   4.871     
0.035   21.694  22.389  1.015   0.859   9.259     
0.035   10.936  10.304  1.015   0.911   4.5       

请帮忙....


Feature-Engine具有内置的DropCorrelatedFeatures()转换器,可以为您完成繁重的工作,并且与sklearn兼容。 features_to_drop_属性显示它将删除哪些内容。 - kevin_theinfinityfund
相关:这个答案在pandas中实现了R的findCorrelation函数。它识别相关的列并返回除一个之外的所有标签。这里的现有答案会删除所有相关的列,这意味着会删除太多的列。 - undefined
28个回答

79

这里的方法对我来说非常有效,只需要几行代码:https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

11
这不是有缺陷的吗?即使第一列可能与任何其他列没有高度相关性,但始终会删除它。当选择上三角形时,第一列的值都不会保留。 - Sushant Kulkarni
1
你有没有输出 corr_matrix 看看它是什么样子的? - Cherry Wu
3
当我删除所选特征时,出现了一个错误,以下的代码对我有效:df.drop(to_drop,axis=1,inplace=True) - Ikbel
4
写这条评论时,这似乎运行良好。我用其他答案中提供的方法进行了交叉检查,涉及不同阈值,结果完全相同。谢谢! - Sunit Gautam
1
它应该是 corr_matrix.where((np.triu(np.ones(corr_matrix.shape), k=1) + np.tril(np.ones(corr_matrix.shape), k=-1)).astype(bool))。你的代码完全没有考虑第一列。 - Mehran
显示剩余2条评论

50

这是我所使用的方法 -

def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

希望这能帮到您!


11
我觉得这个解决方案在以下一般情况下会失败: 假设你有c1、c2和c3三列。c1和c2相关性超过阈值,c2和c3也是如此。 使用这个解决方案,即使c3与c1的相关性没有超过阈值,c2和c3都将被删除。我建议更改:“if corr_matrix.iloc[i, j] >= threshold:” 为:“if corr_matrix.iloc[i, j] >= threshold and (corr_matrix.columns[j] not in col_corr):” - vcovo
如果c1和c2是相关的,c2和c3也是相关的,那么c1和c3也很可能是相关的。不过,如果这不是真的,那么我认为你改变代码的建议是正确的。 - NISHA DAGA
1
它们很可能是相关的,但不一定超过相同的“阈值”。这导致在我的用例中删除的列有显着差异。当添加第一个评论中提到的附加条件时,我最终得到了218列而不是180列。 - vcovo
3
明白了,已按照您的建议更新了代码。 - NISHA DAGA
2
你不应该使用相关矩阵的绝对值吗? - hipoglucido
显示剩余3条评论

12

这里是我创建的自动机器学习类,可以消除特征之间的多重共线性。

我的代码独一无二之处在于,在具有高相关性的两个特征中,我删除了与目标最不相关的特征!我从Vishal Patel先生的这个研讨会中得到了灵感-https://www.youtube.com/watch?v=ioXKxulmwVQ&feature=youtu.be

#Feature selection class to eliminate multicollinearity
class MultiCollinearityEliminator():
    
    #Class Constructor
    def __init__(self, df, target, threshold):
        self.df = df
        self.target = target
        self.threshold = threshold

    #Method to create and return the feature correlation matrix dataframe
    def createCorrMatrix(self, include_target = False):
        #Checking we should include the target in the correlation matrix
        if (include_target == False):
            df_temp = self.df.drop([self.target], axis =1)
            
            #Setting method to Pearson to prevent issues in case the default method for df.corr() gets changed
            #Setting min_period to 30 for the sample size to be statistically significant (normal) according to 
            #central limit theorem
            corrMatrix = df_temp.corr(method='pearson', min_periods=30).abs()
        #Target is included for creating the series of feature to target correlation - Please refer the notes under the 
        #print statement to understand why we create the series of feature to target correlation
        elif (include_target == True):
            corrMatrix = self.df.corr(method='pearson', min_periods=30).abs()
        return corrMatrix

    #Method to create and return the feature to target correlation matrix dataframe
    def createCorrMatrixWithTarget(self):
        #After obtaining the list of correlated features, this method will help to view which variables 
        #(in the list of correlated features) are least correlated with the target
        #This way, out the list of correlated features, we can ensure to elimate the feature that is 
        #least correlated with the target
        #This not only helps to sustain the predictive power of the model but also helps in reducing model complexity
        
        #Obtaining the correlation matrix of the dataframe (along with the target)
        corrMatrix = self.createCorrMatrix(include_target = True)                           
        #Creating the required dataframe, then dropping the target row 
        #and sorting by the value of correlation with target (in asceding order)
        corrWithTarget = pd.DataFrame(corrMatrix.loc[:,self.target]).drop([self.target], axis = 0).sort_values(by = self.target)                    
        print(corrWithTarget, '\n')
        return corrWithTarget

    #Method to create and return the list of correlated features
    def createCorrelatedFeaturesList(self):
        #Obtaining the correlation matrix of the dataframe (without the target)
        corrMatrix = self.createCorrMatrix(include_target = False)                          
        colCorr = []
        #Iterating through the columns of the correlation matrix dataframe
        for column in corrMatrix.columns:
            #Iterating through the values (row wise) of the correlation matrix dataframe
            for idx, row in corrMatrix.iterrows():                                            
                if(row[column]>self.threshold) and (row[column]<1):
                    #Adding the features that are not already in the list of correlated features
                    if (idx not in colCorr):
                        colCorr.append(idx)
                    if (column not in colCorr):
                        colCorr.append(column)
        print(colCorr, '\n')
        return colCorr

    #Method to eliminate the least important features from the list of correlated features
    def deleteFeatures(self, colCorr):
        #Obtaining the feature to target correlation matrix dataframe
        corrWithTarget = self.createCorrMatrixWithTarget()                                  
        for idx, row in corrWithTarget.iterrows():
            print(idx, '\n')
            if (idx in colCorr):
                self.df = self.df.drop(idx, axis =1)
                break
        return self.df

    #Method to run automatically eliminate multicollinearity
    def autoEliminateMulticollinearity(self):
        #Obtaining the list of correlated features
        colCorr = self.createCorrelatedFeaturesList()                                       
        while colCorr != []:
            #Obtaining the dataframe after deleting the feature (from the list of correlated features) 
            #that is least correlated with the taregt
            self.df = self.deleteFeatures(colCorr)
            #Obtaining the list of correlated features
            colCorr = self.createCorrelatedFeaturesList()                                     
        return self.df

你能提供一个如何使用的例子吗? - mjoy
@mjoy 这是一个例子: my_eliminator = MultiCollinearityEliminator(df, 'my_target', 0.95) 然后你可以调用以下函数:cleaned_df_no_multi_collinearity = my_eliminator.autoEliminateMulticollinearity(). 注意:数据框 df 必须包含目标变量列 'my_target'. - JejeBelfort

10

我发现TomDobbs提供的答案非常有用,但它没有按预期工作。它存在两个问题:

  • 在每个相关矩阵行/列的最后一对变量中,它会漏掉最后一对变量。
  • 它未能从返回的数据框中删除每一对共线变量中的一个。

我下面的修订版本解决了这些问题:

def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model 
        to generalize and improves the interpretability of the model.

    Inputs: 
        x: features dataframe
        threshold: features with correlations greater than this value are removed

    Output: 
        dataframe that contains only the non-highly-collinear features
    '''

    # Calculate the correlation matrix
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i+1):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)

            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    x = x.drop(columns=drops)

    return x

1
我真的很喜欢它!我用它来构建一个模型,非常容易理解 - 非常感谢你。 - SQLGIT_GeekInTraining

9

您可以在下面测试此代码?

加载库导入

  pandas as pd
  import numpy as np
# Create feature matrix with two highly correlated features

X = np.array([[1, 1, 1],
          [2, 2, 0],
          [3, 3, 1],
          [4, 4, 0],
          [5, 5, 1],
          [6, 6, 0],
          [7, 7, 1],
          [8, 7, 0],
          [9, 7, 1]])

# Convert feature matrix into DataFrame
df = pd.DataFrame(X)

# View the data frame
df

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features 
df.drop(df[to_drop], axis=1)

3
虽然这段代码可能为问题提供了解决方案,但最好添加上下文来说明它为何/如何运作。这有助于未来的用户学习并将该知识应用于他们自己的代码中。当代码被解释时,您还很可能会获得用户通过点赞的积极反馈。 - borchvm

8
您可以针对给定的数据框 df 使用以下内容:
corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]

2
这对我没有用。请考虑将您的解决方案重写为一个方法。错误:"ValueError: too many values to unpack (expected 2)"。 - MyopicVisage
1
应该改为 high_corr_var=[(corr_matrix.index[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y] - Jeru Luke

6
首先,我建议使用类似PCA的降维方法,但如果你必须自己编写,则问题的限制不足。当两列相关时,您想要移除哪一列?如果A列与B列相关,同时B列与C列相关,但A列与C列不相关呢?
您可以通过调用DataFrame.corr()文档)获得相关性的成对矩阵,这可能有助于您开发算法,但最终您需要将其转换为要保留的列的列表。

虽然我完全同意你的推理,但这并没有真正回答问题。PCA是一个更先进的降维概念。但请注意,使用相关性确实有效,而且这个问题是合理的(但在我看来肯定缺乏研究努力)。 - cel
1
@Jamie bull 感谢您之前的友好回复。在进入高级技术(例如降维技术,如PCA或特征选择方法,如基于树或SVM的特征消除)之前,建议使用基本技术(例如方差计算或相关性计算)来删除无用的特征,我通过各种可用的已发表作品学习了这些技术。至于您评论的第二部分,“通过调用DataFrame.corr()进行相关性计算”对我的情况会有所帮助。 - jax
2
@jax,“通常建议使用基本技术去除无用的功能”,这并不是完全正确的。有许多方法不需要进行这样的预处理步骤。 - cel
@cel 好的,实际上我正在遵循一些已发布的工作,所以他们建议预处理步骤。您能否请推荐任何一个不需要担心预处理步骤的方法,谢谢。 - jax
这里有一个关于在进行PCA之前是否应该删除相关变量的讨论链接。问题在于它们是否相关,因为它们彼此影响或者受到第三个潜在特征的影响,如果是这样,那么有理由删除其中一个。或者,它们之间存在相关性,但并非真正相关,这种情况下有理由保留两者。这取决于对变量的理解,因此不容易通过算法来实现。 - Jamie Bull
1
@JamieBull 感谢您的回复,我在发布这篇文章之前已经访问过了您建议的网站链接。但是如果您仔细阅读了问题,您会发现这篇文章只涵盖了问题的一半答案,但我已经阅读了很多资料,希望很快能够自己发布答案。非常感谢您的支持和关注。谢谢。 - jax

5

我有些自作主张地修改了TomDobbs的回答。评论中报告的错误现在已经被修复了。此外,新的函数也过滤掉了负相关。

def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if abs(val) >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        x = x.drop(col, axis=1)
    return x

你这里的循环跳过了corr_matrix的前两列,因此col1和col2之间的相关性没有被考虑进去,除此之外看起来还不错。 - Ryan
@Ryan 你是怎么解决的? - poPYtheSailor
@poPYtheSailor 请查看我发布的解决方案。 - Ryan

3

将您的特征数据框放入此函数中,并设置相关阈值。它会自动删除列,但如果您想手动执行,则还会提供删除列的诊断。

def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if val >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        df = x.drop(col, axis=1)

    return df

8
对我来说似乎不起作用。已经发现了相关性并打印出与阈值相匹配的对。但是生成的数据框只缺少一个变量(第一个),该变量具有很高的相关性。 - n1k31t4

3

我知道已经有很多关于这个问题的答案了,但我发现以下方法非常简单而且简短:


# Get correlation matrix 
corr = X.corr()

# Create a mask for values above 90% 
# But also below 100% since it variables correlated with the same one
mask = (X.corr() > 0.9) & (X.corr() < 1.0)
high_corr = corr[mask]

# Create a new column mask using any() and ~
col_to_filter_out = ~high_corr[mask].any()

# Apply new mask
X_clean = X[high_corr.columns[col_to_filter_out]]

# Visualize cleaned dataset
X_clean

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接