Pandas - 计算所有列的z-score

Question

Pandas - 计算所有列的z-score

79

我有一个数据框，它包含一个仅有ID列和其他所有列都是数值类型的值，我想要计算这些数值类型列的z分数。以下是其中的一个子集：

ID      Age    BMI    Risk Factor
PT 6    48     19.3    4
PT 8    43     20.9    NaN
PT 2    39     18.1    3
PT 9    41     19.5    NaN

我有一些包含 NaN 值的列，我不想将其包括在 z-score 计算中，因此我打算使用此问题提供的解决方案：how to zscore normalize pandas column with nans?

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

我希望将此解决方案应用于除ID列以外的所有列，以生成一个新的数据框，然后可以使用Excel文件保存

df2.to_excel("Z-Scores.xlsx")

基本上，我如何为每列计算z得分（忽略NaN值）并将所有内容推入新数据框中？

顺带一提：在pandas中有一个叫做“索引”的概念，这让我感到害怕，因为我不太理解它。如果索引是解决这个问题的关键部分，请简化您对索引的解释。

- Slavatron

9个回答

94

从列中构建列表，并删除您不想计算Z得分的列：

In [66]:
cols = list(df.columns)
cols.remove('ID')
df[cols]

Out[66]:
   Age  BMI  Risk  Factor
0    6   48  19.3       4
1    8   43  20.9     NaN
2    2   39  18.1       3
3    9   41  19.5     NaN
In [68]:
# now iterate over the remaining columns and create a new zscore column
for col in cols:
    col_zscore = col + '_zscore'
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
df
Out[68]:
   ID  Age  BMI  Risk  Factor  Age_zscore  BMI_zscore  Risk_zscore  \
0  PT    6   48  19.3       4   -0.093250    1.569614    -0.150946   
1  PT    8   43  20.9     NaN    0.652753    0.074744     1.459148   
2  PT    2   39  18.1       3   -1.585258   -1.121153    -1.358517   
3  PT    9   41  19.5     NaN    1.025755   -0.523205     0.050315   

   Factor_zscore  
0              1  
1            NaN  
2             -1  
3            NaN

- EdChum

29

如果你想计算所有列的z分数，你可以使用以下代码：

df_zscore = (df - df.mean())/df.std()

- Joe Bathelt

6

@pitosalas: @ascripter，你是正确的。使用df.std(ddof=0)和df.apply(scipy.stats.zscore)会得到相同的结果。 - roob

8

以下是使用自定义函数获取Z分数的另一种方法：

：

In [6]: import pandas as pd; import numpy as np

In [7]: np.random.seed(0) # Fixes the random seed

In [8]: df = pd.DataFrame(np.random.randn(5,3), columns=["randomA", "randomB","randomC"])

In [9]: df # watch output of dataframe
Out[9]:
    randomA   randomB   randomC
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

## Create custom function to compute Zscore 
In [10]: def z_score(df):
   ....:         df.columns = [x + "_zscore" for x in df.columns.tolist()]
   ....:         return ((df - df.mean())/df.std(ddof=0))
   ....:

## make sure you filter or select columns of interest before passing dataframe to function
In [11]: z_score(df) # compute Zscore
Out[11]:
   randomA_zscore  randomB_zscore  randomC_zscore
0        0.798350       -0.106335        0.731041
1        1.505002        1.939828       -1.577295
2       -0.407899       -0.875374       -0.545799
3       -1.207392       -0.463464        1.292230
4       -0.688061       -0.494655        0.099824

使用scipy.stats zscore重现的结果

In [12]: from scipy.stats import zscore

In [13]: df.apply(zscore) # (Credit: Manuel)
Out[13]:
    randomA   randomB   randomC
0  0.798350 -0.106335  0.731041
1  1.505002  1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464  1.292230
4 -0.688061 -0.494655  0.099824

- Surya

7

对于Z分数，我们可以坚持使用文档而不是使用 'apply' 函数。

from scipy.stats import zscore
df_zscore = zscore(cols as array, axis=1)

- ibozkurt79

4

几乎一行代码的解决方案：

df2 = (df.ix[:,1:] - df.ix[:,1:].mean()) / df.ix[:,1:].std()
df2['ID'] = df['ID']

- Josh Chartier

2

`stats.zscore`来自scipy

stats.zscore（在Manuel的回答中提到）适用于DataFrame / 2D数组，因此不需要通过apply()调用它（因为apply是Python for循环的语法糖，如果有很多列，它会明显变慢¹）。从语法上说，只需要对DataFrame调用zscore即可。

from scipy import stats
df = pd.DataFrame([[0,1,2],[3,3,5],[5,6,100]]).add_prefix('col')
zscore_df = stats.zscore(df)

如果需要对某些列进行标准化，只需选择这些列并计算Z分数。

stats.zscore(df[['col0', 'col2']])

您可以验证这确实返回与对每列应用zscore以及手动计算（(df - df.mean())/df.std(ddof=0)）相同的DataFrame。

x = stats.zscore(df)
y = df.apply(stats.zscore)
z = (df - df.mean()) / df.std(ddof=0)
np.allclose(x, y) and np.allclose(x, z)  # True

`scikit-learn` 中的 `StandardScaler`

另一种方法是从 scikit-learn 中调用 StandardScaler()。只需实例化 StandardScaler，然后使用相关列作为输入调用 fit_transform。结果是一个 numpy 数组，您可以将其赋值回数据框作为新列（或对数组本身进行操作等）。

from sklearn.preprocessing import StandardScaler

cols = ['col1', 'col2']
new_cols = [f"{c}_zscore" for c in cols]

sc = StandardScaler()
df[new_cols] = sc.fit_transform(df[cols])

¹ 一项 timeit 测试显示，对于一个包含100列的 DataFrame，直接在列上调用 zscore 比使用 apply() 在每列上调用它要快大约30倍。此外，正如 Joe Bathelt 的答案中提到的，直接计算实际上是最好的。

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame(np.random.default_rng(0).choice(100, size=(1000, 100))).add_prefix('col')

%timeit df.apply(stats.zscore)
# 105 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit stats.zscore(df)
# 3.63 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.sub(df.mean()).div(df.std(ddof=0))
# 2.86 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit StandardScaler().fit_transform(df)
# 6.89 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

- cottontail

1

当我们处理时间序列时，计算z分数（或异常值-不是同一件事，但您可以轻松地调整此代码）会更加复杂。例如，您有10年的每周测量的温度数据。要计算整个时间序列的z分数，您必须知道每年每天的平均值和标准偏差。所以，让我们开始吧：

假设您有一个pandas DataFrame。首先，您需要一个DateTime索引。如果您还没有它，但幸运的是您有一个带有日期的列，只需将其作为索引。 Pandas将尝试猜测日期格式。这里的目标是拥有DateTimeIndex。您可以通过尝试进行检查：

type(df.index)

如果您没有一个，请让我们创建一个。

df.index = pd.DatetimeIndex(df[datecolumn])
df = df.drop(datecolumn,axis=1)

下一步是计算每组天数的平均值和标准差。为此，我们使用groupby方法。

mean = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanmean)
std = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanstd)

最后，我们循环遍历所有日期，执行计算（value - mean）/ stddev；但是，正如提到的那样，对于时间序列来说，这并不那么直观。

df2 = df.copy() #keep a copy for future comparisons 
for y in np.unique(df.index.year):
    for d in np.unique(df.index.dayofyear):
        df2[(df.index.year==y) & (df.index.dayofyear==d)] = (df[(df.index.year==y) & (df.index.dayofyear==d)]- mean.ix[d])/std.ix[d]
        df2.index.name = 'date' #this is just to look nicer

df2 #this is your z-score dataset.

在for循环中的逻辑是：对于给定的年份，我们必须将每个dayofyear与其均值和标准差进行匹配。我们对您时间序列中的所有年份运行此操作。

- Deninhos

0

要快速计算整个列的z分数，请按照以下步骤进行：

from scipy.stats import zscore
import pandas as pd

df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
df['num_1_zscore'] = zscore(df['num_1'])

display(df)

- BGG16

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Manuel · Accepted Answer

使用Scipy的zscore函数：

df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
df

|    |   A |   B |   C |
|---:|----:|----:|----:|
|  0 | 163 | 163 | 159 |
|  1 | 120 | 153 | 181 |
|  2 | 130 | 199 | 108 |
|  3 | 108 | 188 | 157 |
|  4 | 109 | 171 | 119 |

from scipy.stats import zscore
df.apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

如果您的数据框中并非所有列都为数字，则可以使用 select_dtypes 函数仅对数值列应用 Z 分数函数：

# Note that `select_dtypes` returns a data frame. We are selecting only the columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

Pandas - 计算所有列的z-score

使用scipy.stats zscore重现的结果

stats.zscore来自scipy

scikit-learn 中的 StandardScaler

`stats.zscore`来自scipy

`scikit-learn` 中的 `StandardScaler`