Python - 计算数据框列的标准差（行级别）

Question

Python - 计算数据框列的标准差（行级别）

6

我已创建了一个Pandas数据框，并可以确定该数据框中一个或多个列（按列级别）的标准差。我需要确定特定列所有行的标准差。以下是我到目前为止尝试过的命令：

# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()

salary         8.194421e-01
num_months     3.690081e+05
no_of_hours    2.518869e+02

# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)

# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()

salary         8.194421e-01

# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)

0       4.374107e+12
1       4.377543e+12
2       4.374026e+12
3       4.374046e+12
4       4.374112e+12
5       4.373926e+12

当我执行下面的命令时，所有记录都会显示“NaN”。有没有办法解决这个问题？

# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN

- JKC

1

不确定“一个列的所有行的标准偏差”是什么意思。那不就是该列的标准偏差，而不是一列而是一个标量数吗？您能发布生成DataFrame的代码以及要计算标准偏差的列/行吗？ - Indominus

2

您正在计算单个数字（逐行一列）的标准偏差... 您会期望什么结果？它是NaN，因为它除以 N-1，其中 N 是 1。 - filippo

@filippo 抱歉，我之前不知道它出现 NaN 的原因。现在我明白了。感谢您的建议。 - JKC

@Indominus 没错。如果我们只对一列进行 std，它将仅返回一个标量。如 jezrael 所解释的那样，我必须与另一列结合以获得正确的 std 值。 - JKC

1

@JKC 不需要道歉 ;-) 或许我听起来太苛刻了。我的意思是从你的问题中并不清楚你的问题是关于NaN还是你没有注意到你正在对单个样本计算标准偏差。很高兴现在问题解决了！ - filippo

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

预计会出现这种情况，因为如果检查 DataFrame.std：

默认情况下通过 N-1 进行标准化。可以使用 ddof 参数进行更改。

如果您只有一个元素，则进行了除以 0 的操作。因此，如果您只有一列并希望在列上获取样本标准差，请获取所有缺失值。

示例：

inp_df = pd.DataFrame({'salary':[10,20,30],
                       'num_months':[1,2,3],
                       'no_of_hours':[2,5,6]})
print (inp_df)
   salary  num_months  no_of_hours
0      10           1            2
1      20           2            5
2      30           3            6

按一列一列地用 [] 选择 Series：

print (inp_df['salary'])
0    10
1    20
2    30
Name: salary, dtype: int64

获取 Series 的标准差 - 获取一个标量：

print (inp_df['salary'].std())
10.0

如果是 one column DataFrame，可以通过双重使用 [] 来选择其中一列：

print (inp_df[['salary']])
   salary
0      10
1      20
2      30

获取按索引（默认值）DataFrame的标准差 - 获取一个元素Series:

print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary    10.0
dtype: float64

获取每列（axis=1）DataFrame的标准差 - 获取所有NaN值：

print (inp_df[['salary']].std(axis = 1))
0   NaN
1   NaN
2   NaN
dtype: float64

如果将默认值ddof=1更改为ddof=0：

print (inp_df[['salary']].std(axis = 1, ddof=0))
0    0.0
1    0.0
2    0.0
dtype: float64

如果您想按两个或更多列排序：

std

#select 2 columns
print (inp_df[['salary', 'num_months']])
   salary  num_months
0      10           1
1      20           2
2      30           3

#std by index
print (inp_df[['salary','num_months']].std())
salary        10.0
num_months     1.0
dtype: float64

#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0     5.656854
1    10.606602
2    16.970563
dtype: float64