NumPy直方图累积密度之和不为1。

Question

NumPy直方图累积密度之和不为1。

16

借鉴另一个线程的建议 (@EnricoGiampieri's answer to cumulative distribution plots python)，我编写了以下代码：

# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')

我从 np.histogram 的文档中获取了 density=1，文档说：

"请注意，直方图值的总和将不等于1，除非选择单位宽度的箱子；它不是概率质量函数。"

确实，当绘制出来时，它们并不总和为1。但是，我不理解什么是 "单位宽度的箱子"。当我将箱子设为1时，当然会得到一个空图表；当我将它们设置为人口数量时，我没有得到总和为1（类似于0.2）。当我使用建议的40个箱子时，它们总和约为0.006。

有人能给我一些指导吗？谢谢！

- J Kelly

1

区域面积总和是否为1？ - Paul H

我猜是的。Paul，对不起——我的统计知识比较薄弱。我正在参考一个R示例，其中y轴值从0到1变化，而CDF上限为1。 - J Kelly

曲线在0.2处达到峰值，但在2000-8000的x值范围内，因此我认为该区域将为1。 - J Kelly

对我来说，当我有从np.arange(0, 1005, 10)的bins时，我只需要将所有内容乘以10。我还没有检查过，但似乎你只需要将密度乘以差异因子，这在我的情况下是10。 - A.Ametov

3个回答

10

你需要确保你的箱子都是宽度为1。也就是说：

np.all(np.diff(base)==1)

为了实现这一点，您需要手动指定您的“bins”：

bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)

你会得到：

In [18]: np.all(np.diff(base)==1)
Out[18]: True

In [19]: np.sum(values)
Out[19]: 0.99999999999999989

- perimosocordiae

太好了！谢谢你，现在曲线更接近我的目标了。 - J Kelly

5

根据文档：如果 bins 是一个整数，则它定义了给定范围内等宽箱的数量（默认为 10）。因此，OP 的示例应该默认工作，不是吗？看起来像是一个 bug。 - naught101

1

彼此宽度相等，但不一定是宽度1。 - perimosocordiae

1

啊，我明白了，它加起来等于箱子的宽度，所以对于等宽箱，你可以通过除以 base[1]-base[0] 得到单位。 - naught101

0

实际上，该语句

"请注意，除非选择单位宽度的箱子，否则直方图值之和将不等于1；它不是概率质量函数。"

意味着我们得到的输出是相应箱子的概率密度函数，现在由于在概率密度函数中，两个值'a'和'b'之间的概率由范围'a'和'b'之间的曲线下面积表示。因此，要获取相应箱子的概率值，我们必须将该箱子的pdf值乘以其箱宽，然后所获得的概率序列可以直接用于计算累计概率（因为它们现在已被归一化）。

请注意，新计算出的概率之和为1，这满足总概率之和为1的事实，或者换句话说，我们可以说我们的概率已经被标准化了。

请参见下面的代码，在这里，我使用了不同宽度的箱子，有些宽度为1，有些宽度为2。

import numpy as np
import math
rng = np.random.RandomState(10)   # deterministic random data
a = np.hstack((rng.normal(size=1000),
               rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar)  # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar, 
                             density = True)
print(counts)    # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges)))  #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.

现在我认为他们提到箱子的宽度应该为1的原因是，如果箱子的宽度等于1，则任何箱子的概率密度函数和概率值都相等，因为如果我们计算箱子下面的面积，就是将1与该箱子的概率密度函数相乘，这个结果又等于相应概率密度函数的值。

因此，在这种情况下，概率密度函数的值等于相应箱子概率值的值，并且已经归一化了。

- bhanu pratap

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Paul H · Accepted Answer

你可以像这样简单地自行规范化values变量：

unity_values = values / values.sum()

完整的示例看起来会是这样：

import numpy as np
import matplotlib.pyplot as plt

x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)

ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)

ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()

在这里输入图片描述