在Pandas中查找测试数据框中数据的z-score

Question

在Pandas中查找测试数据框中数据的z-score

3

我有一些分组的数据，并将其拆分为训练集和测试集。我想要计算 z-scores。在训练集上，这很容易，因为我可以使用内置函数来计算平均值和标准差。

以下是一个示例，其中我正在寻找按地点分组的 z-scores: import pandas as pd import numpy as np # 我的示例数据框

train = pd.DataFrame({'place':     ['Winterfell','Winterfell','Winterfell','Winterfell','Dorne', 'Dorne','Dorne'],
                      'temp' : [ 23 , 10 , 0 , -32, 90, 110, 100 ]})
test  = pd.DataFrame({'place': ['Winterfell', 'Winterfell', 'Dorne'],
                      'temp' : [6, -8, 100]})

# get the z-scores by group for the training set
train.loc[: , 'z' ] = train.groupby('place')['temp'].transform(lambda x: (x - x.mean()) / x.std())

现在的训练数据框的形式如下：

|    Place   | temp |   z   |
|------------|------|-------|
| Winterfell |    23| 0.969 |
| Winterfell |    10| 0.415 |
| Winterfell |     0|-0.011 |
| Winterfell |   -32|-1.374 |
|      Dorne |    90| 1.000 |
|      Dorne |   110|-1.000 |
|      Dorne |   100| 0.000 |

这正是我想要的。

问题是，我现在想使用训练集中的均值和标准差来计算测试集中的z分数。我可以很容易地得到这些均值和标准差：

summary = train.groupby('place').agg({'temp' : [np.mean, np.std]} ).xs('temp',axis=1,drop_level=True)

print(summary)

          mean        std
place                        
Dorne       100.00  10.000000
Winterfell    0.25  23.471614

我有一些复杂的方式可以完成我想做的事情，但由于这是我经常要做的任务，所以我正在寻找一种简洁的方法来完成它。到目前为止，我尝试过以下方法：

Making a dictionary dict out of the summary table, where I can extract the mean and standard devation as a tuple. Then on the test set, I can do an apply:
```
test.loc[: , 'z'] = test.apply(lambda row: (row.temp - dict[row.place][0]) / dict[row.place][1] ,axis = 1)
```

为什么我不喜欢它：

dictionary makes it hard to read, need to know what the structure of dict is.
If a place appears in the test set but not the training set, instead of getting a NaN, the code will throw an error.
1. Using an index
```
test.set_index('place', inplace = True)
test.loc[:, 'z'] = (test['temp'] - summary['mean'])/summary['std']
```

我为什么不喜欢它： - 看起来应该可以工作，但实际上只给我 NaNs

最终结果应该是有没有一种标准的Pythonic方法来做这种组合？

- Damien Martin

这个答案可能会对你有所帮助：https://dev59.com/rWAf5IYBdhLWcg3wVxcp - walker_4

谢谢！在编写我的解决方案时，我看到了那个，尽管它侧重于从数据框中计算z分数，而不是使用来自单独数据框的均值。时间序列示例接近于我正在寻找的内容。 - Damien Martin

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- piRSquared · Accepted Answer

Option 1
pd.Series.map

test.assign(z=
    (test.temp - test.place.map(summary['mean'])) / test.place.map(summary['std'])
)

        place  temp         z
0  Winterfell     6  0.244977
1  Winterfell    -8 -0.351488
2       Dorne   100  0.000000

选项 2
pd.DataFrame.eval

test.assign(z=
    test.join(summary, on='place').eval('(temp - mean) / std')
)

        place  temp         z
0  Winterfell     6  0.244977
1  Winterfell    -8 -0.351488
2       Dorne   100  0.000000