用Python计算Pearson相关性和显著性

Question

用Python计算Pearson相关性和显著性

pythonnumpystatisticsscipy

224

我正在寻找一个函数，它以两个列表作为输入，并返回Pearson相关系数和相关性的显著性。

- ariel

19个回答

120

皮尔逊相关性可以使用 NumPy的 corrcoef来计算。

import numpy
numpy.corrcoef(list1, list2)[0, 1]

- winerd

1

输出看起来很混乱，但实际上非常简单。请查看此解释 https://dev59.com/EHA75IYBdhLWcg3wMmAZ#3425548 - linqu

8

这并没有产生所需的相关性显著性对吧？ - Olivier

正如其他人所说，如果没有显著性，计算相关系数是没有意义的，而且目前的numpy库也不会返回相关系数的显著性。 - undefined

61

另一种选择是使用SciPy中的原生函数linregress，它可以计算以下内容：

slope：回归线的斜率

intercept：回归线的截距

r-value：相关系数

p-value：双侧p值，用于检验斜率是否为零的假设

stderr：估计的标准误差

以下是一个示例：

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

会给你返回：

将返回给您：

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

- Salvador Dali

2

非常好的答案 - 迄今为止最具信息量。也适用于两行pandas.DataFrame：lineregress(two_row_df)。 - dmeu

很棒的答案。如果你想一想，它也非常直观。 - pseudomonas

1

请注意：stderr是斜率的标准错误，而不是Pearson相关系数的标准错误。 - MRule

就我个人而言，rvalue 应该是皮尔逊系数。 - undefined

39

如果你不想安装SciPy，我使用了这个快速的技巧，稍微修改自Programming Collective Intelligence。

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

- Jeff Hammerbacher

2

我很惊讶地发现这与Excel、NumPy和R不一致。请参见https://dev59.com/qG865IYBdhLWcg3wLbar#7939259。 - dfrankow

2

正如另一位评论者指出的那样，这里存在一个浮点/整数错误。我认为对于整数来说，sum_y/n是整数除法。如果您使用sum_x = float(sum(x))和sum_y = float(sum(y)),它就可以工作了。 - dfrankow

@dfrankow 我认为这是因为imap无法处理浮点数。Python在num = psum - (sum_x * sum_y/n)处给出了一个TypeError: unsupported operand type(s) for -: 'itertools.imap' and 'float'的错误提示。 - alvas

4

作为一种风格注意事项，Python 不赞成这种不必要的 map 用法（而更倾向于使用列表推导式）。 - Maxim Khesin

16

就这条评论而言，需要考虑到像Scipy等库的开发人员都非常懂得数值分析。这可以避免您遇到许多常见陷阱（例如，X或Y中有非常大和非常小的数字可能会导致灾难性的计算误差）。 - finiteautomata

31

以下代码是对定义的直接解释：

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

测试：

print pearson_def([1,2,3], [1,5,7])

返回

0.981980506062

这与Excel、该计算器、SciPy（也包括NumPy）的结果一致，它们分别返回0.981980506、0.9819805060619657和0.98198050606196574。

R：

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

编辑：修复了一位评论者指出的错误。

- dfrankow

4

注意变量的类型！你遇到了一个整型和浮点型的问题。在sum(x) / len(x)中，你除的是整数而不是浮点数。因此，根据整数除法，sum([1,5,7]) / len([1,5,7]) = 13 / 3 = 4（而你想要 13. / 3. = 4.33...）。为了解决这个问题，将此行重写为 float(sum(x)) / float(len(x))（只需要一个float即可，因为Python会自动转换）。 - Piotr Migdal

你的代码无法处理以下情况：[10,10,10]，[0,0,0]或[10,10]，[10,0]。甚至是[10,10]，[10,10]。 - madCode

4

对于这些情况，相关系数未被定义。将它们输入R中会返回所有三个的"NA"。 - dfrankow

28

你也可以使用pandas.DataFrame.corr 来完成这个操作：

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

这给出了

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

- Martin Thoma

11

这只是相关性，没有显著性。 - Ivelin

11

使用Python中的Pandas计算Pearson系数：

我建议尝试这种方法，因为你的数据包含列表。通过控制台与数据进行交互和操作将变得非常容易，因为你可以可视化数据结构并按需更新它。你还可以导出数据集并保存它，从Python控制台外添加新数据以供后续分析。这段代码更简单，只包含较少的代码行。我假设你需要几行快速的代码来筛选数据以供进一步分析。

示例：

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results

然而，您并没有发布数据供我查看数据集的大小或在分析之前可能需要进行的转换。

- Web Ster

你好，欢迎来到StackOverflow！尝试在你的回答开头添加一个简短的描述，说明你为什么选择这段代码以及它在这种情况下如何应用！ - Tristo

10

不是依赖于NumPy或者SciPy，我认为我的答案应该是最容易编码和理解计算皮尔逊相关系数（PCC）的步骤的方法。

import math

# Calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x)

# Calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# Calculates the PCC using both the two functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x:
        scorex.append((i - mean(x))/sampleStandardDeviation(x))

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# Multiplies both lists together into 1 list (hence zip) and sums the whole list
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC的意义基本上是展示两个变量/列表之间的强相关性有多大。

需要注意的是，PCC值的范围从-1到1。 0到1之间的值表示正相关。值为0表示最高变异（完全没有相关性）。 -1到0之间的值表示负相关。

- compski

3

请注意，Python内置了sum函数。 - bfontaine

5

它在包含500多个值的2个列表上表现出令人惊奇的复杂性和缓慢的性能。 - Nikolay Fominyh

99%的答案只是浪费资源。你忽略了重要的检验。 - undefined

6

嗯，很多这些回复都有很长且难以阅读的代码...

我建议在处理数组时使用NumPy，它具有很多方便的功能。

import numpy as np

def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

- Martin F Thomsen

虽然我非常喜欢这个答案，但我建议在函数内部复制/克隆X和Y。否则，两者都会被更改，这可能不是期望的行为。 - antonimmo

5

这是对mkh的答案的一个变体，比它和scipy.stats.pearsonr更快，使用Numba。

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

- Turtles Are Cute

这里没有任何一个叫做"mkh'"的用户。它指的是哪个回答？ - Peter Mortensen

这里没有任何一个叫做"mkh'"的用户。它指的是什么答案？ - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sacha · Accepted Answer

你可以看一下scipy.stats：

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

输出：

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation