标准差() 分组 Pandas 问题

Question

标准差() 分组 Pandas 问题

pandasdataframestdpandas-groupbydescribe

4

这可能是一个bug吗？当我使用describe()或std()函数对groupby对象进行操作时，得到的结果不同

import pandas as pd
import numpy as np
import random as rnd

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     ...:                           'foo', 'bar', 'foo', 'foo'],
     ...:                    'B' : ['one', 'one', 'two', 'three',
     ...:                           'two', 'two', 'one', 'three'],
     ...:                    'C' : 1*(np.random.randn(8)>0.5),
     ...:                    'D' : np.random.randn(8)})
df.head()

df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each    group value of C is constant, so that makes sense. 

df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong

- OzgunBu

3个回答

1

我和我的朋友 Mukherjees 在这个问题上进行了更多的尝试，并决定 std() 存在问题。您可以在以下链接中看到我们如何展示 "std() 不同于 .apply(np.std, ddof=1)"。在注意到之后，我们还发现了以下相关的错误报告：

https://github.com/pandas-dev/pandas/issues/10355

- OzgunBu

-1

即使使用std()函数，您在每个组内仍将得到C的零标准偏差。我只是在您的代码中添加了一个种子以使其可复制。我不确定问题是什么 -

import pandas as pd
import numpy as np
import random as rnd

np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     'foo', 'bar', 'foo', 'foo'],
     'B' : ['one', 'one', 'two', 'three',
     'two', 'two', 'one', 'three'],
     'C' : 1*(np.random.randn(8)>0.5),
     'D' : np.random.randn(8)})
df

df[['C','D']].groupby(['C'],as_index=False).describe()

df[['C','D']].groupby(['C'],as_index=False).std()

要进一步深入，如果您查看继承自DataFrame.describe的groupby的描述源代码，

def describe_numeric_1d(series):
            stat_index = (['count', 'mean', 'std', 'min'] +
                          formatted_percentiles + ['max'])
            d = ([series.count(), series.mean(), series.std(), series.min()] +
                 [series.quantile(x) for x in percentiles] + [series.max()])
            return pd.Series(d, index=stat_index, name=series.name)

以上代码显示，describe仅显示std()的结果。

- Aritesh

1

我并没有真正看到对这个问题的答案。 - cs95

C列下的第二行正是让我感到困惑的地方（0,1而不是0,0）。感谢您花时间将其转化为代码并运行。 - OzgunBu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cs95 · Accepted Answer

有道理。在第二种情况下，你只计算列 D 的 std。

怎么做呢？这就是 groupby 的工作方式。

1. 在 C 和 D 上进行切片。 2. 对 C 进行 groupby。 3. 调用 GroupBy.std。

在步骤 3 中，你没有指定任何列，所以假设 std 是在不是分组键的那一列上计算的...也就是列 D。

至于为什么你会看到 C 和 0, 1...那是因为你指定了 as_index=False，所以插入了 C 列，并且值来自原始数据框...在这种情况下是 0, 1。

运行代码后就会变得清晰明了。

df[['C','D']].groupby(['C']).std()

          D
C          
0  0.998201
1       NaN

当您指定as_index=False时，您看到的索引将被插入为一个列。与此相对比的是，

df[['C','D']].groupby(['C'])[['C', 'D']].std()

     C         D
C               
0  0.0  0.998201
1  NaN       NaN

这正是describe所提供的，也是你所需要的。